-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Vectorize softmax and logsoftmax #7375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6ed9c6f to
2ce4e9c
Compare
|
First single core timings |
aten/src/ATen/native/Activation.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
4695f10 to
e25f291
Compare
|
Updated branch timings for softmax |
|
This will probably conflict with #7275 which reorganize the Unary/Reduce op prologues so that the CUDA-only functions live in the |
|
What's the plan for enabling vectorization for these functions with NEON (128-bit wide) (or AVX512 (512-bit wide)? Vec256 seems pretty AVX2 specific. Is the idea you'd implement it N different times, once for each bitwidth? |
|
@ajtulloch I think in that case we could just keep 2 128-bit wide registers as a single value (similarly to how Vec256 is implemented when AVX is disabled). AVX-512 is a completely different thing, but it's unclear if we want to go down this path, considering all the issues with downclocking. |
|
@ajtulloch The code can definitely handle different bit width as it's only using Vec::size and you can ifdef on CPU_CAPABILITY_, but it's not clear if that'll immediately be a good idea without investigating case by case. As for ARM, the plan is to write only kernels specific to CPU for now. What do you think would be a good strategy? |
549cdf4 to
2e39a2f
Compare
|
I have a character RNN locally and it produces NaN when using this diff (inf loss), yet all test_nn tests succeed. So, let's not merge this even if all tests succeed. |
|
@apaszke AVX2 also has issues with downclocking, it’s not safe to just unconditionally enable it btw. @cpuhrsch ARM CPUs are CPUs too, I assume you meant you only want to handle Intel CPUs? It seems like the right abstraction is basically the Eigen Packet abstraction since that’s essentially what Vec256 is a special case of (packet is basically a C++ wrapper around a generic SIMD type, so Vec256 is essentially a Packet<T, 256>, so if this is generalized to more than AVX2 then IMO that makes sense. |
|
@ajtulloch Yes, that's correct. x86 cpus. I'm associating ARM with mobile and somehow that then turned into "not CPU". There's also something to be said about wanting to split ARM and x86 into two separate categories (of course they can still share), because you'd probably want the aten library to be very small for mobile things. We can do something as in this PR. But that can probably also be done at a CMake level. |
c0fad45 to
86cd5d1
Compare
aten/src/ATen/native/Activation.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think quite a bit of this code could be simplified, using a higher-level vec_reduce helper. We actually have something like this in a CUDA implementation, and it generally ends up being much more readable.
aten/src/ATen/native/Activation.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
19d3d9c to
4154a9b
Compare
|
One key concern that is left, is whether or not it's important to accumulate sums into doubles or not. On my local example (character rnn) this appears to make no difference. If someone needs that extra precision, they could also upcast all to doubles. |
|
I'm seeing a 4.5x on single core and 1.8x on 10 threads using this code and these commands |
4a87cf0 to
6add1a7
Compare
|
Deciding on whether to accumulate into doubles or floats from a perf perspective. I'm using the follow analysis to investigate whether this algorithm is compute-bound. If it is, we should accumulate into floats by default and advise users with numerical stability issues to upcast to doubles as that case will be rare. My development machine has two NUMA nodes with the following layout Any kernel that is to some extent memory bound will see heavy penalties from being scheduled on node 0 cpus, but have their memory bound to node 1 memory, since using the QPI is much slower than reading from local memory. I'm using the following code to get some measurements, which triggers the vectorized branches. The size of the input tensor is chosen to be roughly representative of a typical (larger) workload. It has a batchsize of 128 and 20k classes. There are two investigations: 1) Set the cpu frequency to different values 2) Bind memory to a different NUMA node. I'm using the following code For 1) we can observe For 2) we can observe In conclusion, we see a decrease in wall-time (3.2s to 6.0s and 3.4s to 6.1s on both memory nodes) proportional to a decrease in cpu frequency (2.2GHz to 1.2 GHz), however only a very small decrease in wall-time when allocating memory on a NUMA node away from the used CPUs. This leads me to believe that the kernel is compute bound on a single core and my particular computer. I've also compare the wall-time of this kernel with master. Even when using doubles, we still get a 2x improvement in runtime (on a single core) in comparison to the current implementation. With floats we get around a ~5.5x improvement. |
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice. Have some final minor comments and should be good to go. I'm not sure where is the grad formula for softmax coming from, since it doesn't seem to match what we do e.g. for CUDA.
aten/src/ATen/CMakeLists.txt
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| Vec output_vec = vec_fun(data_vec, data_vec2); | ||
| output_vec.store(output_data + d, size - d); | ||
| } | ||
| } |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| case 6: | ||
| return blend<63>(a, b); | ||
| case 7: | ||
| return blend<127>(a, b); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| #include "intrinsics.h" | ||
| #include "vec256_base.h" | ||
|
|
||
| //TODO: Add tests for partial loads |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| case 6: | ||
| return blend<63>(a, b); | ||
| case 7: | ||
| return blend<127>(a, b); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| int64_t i = ii + j; | ||
| scalar_t* input_data = input_data_base + i * dim_size; | ||
| max_input_arr[j] = vec256::reduce_all<scalar_t>( | ||
| [](Vec& x, Vec& y) { return vec256::max(x, y); }, |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| Vec::loadu(max_input_arr + j, loop_end - j) + | ||
| Vec::loadu(tmp_sum_scalar + j, loop_end - j).log(); | ||
| tmp_sum_scalar_vec.store(tmp_sum_scalar + j, loop_end - j); | ||
| } |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| static tbb::affinity_partitioner ap; | ||
|
|
||
| template <int64_t size> | ||
| inline int64_t _leftover(int64_t x, int64_t y) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| dim_size); | ||
| } else { | ||
| vec256::map2( | ||
| [sum](Vec x, Vec y) { return (x - Vec(sum)) * y; }, |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| [](Vec x, Vec y) { return x + y; }, | ||
| grad_data, | ||
| output_data, | ||
| dim_size); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This reverts commit 0585394.
This PR uses Vec256 to vectorize the softmax and logsoftmax Layers. This comes in 4 steps: log_softmax softmax log_softmax_backward softmax_backward * Vectorized Softmax and LogSoftmax * Abstractions * Style * Remove <limits> for Kernel * Perf investigations * Last cleanups
This PR uses Vec256 to vectorize the softmax and logsoftmax Layers.
This comes in 4 steps:
I will remove the SLEEF commit, once the corresponding PR lands.