[ATEN][CUDA] Reduce register pressure in radix_sort_pairs to improve torch.sort performance#167094
[ATEN][CUDA] Reduce register pressure in radix_sort_pairs to improve torch.sort performance#167094YyWangCS wants to merge 4 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167094
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 69a1c26 with merge base 0b4dd08 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "release notes: cuda" |
|
cc @peterbell10 |
ngimel
left a comment
There was a problem hiding this comment.
It's interesting that compiler is not doing it by itself. Thanks!
| // the data, so we can save template instantiations by reinterpreting | ||
| // it as an opaque type. | ||
| // We use native integer types for 1/2/4/8-byte values to reduce | ||
| // register usage in CUDA kernels. For sizes > 8 fall back to char array. |
There was a problem hiding this comment.
If only uint_128t existed in the C++ standard. Clang/GCC have it as extensions i think, sigh...
There was a problem hiding this comment.
Actually, it occured to me. Does uint4_t exist in CUDA for float4? I guess wouldn't even matter in this case for register usage, would it?
There was a problem hiding this comment.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#host-compiler-extensions Apparently CUDA supports it, should we add it?
There was a problem hiding this comment.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#host-compiler-extensions Apparently CUDA supports it, should we add it?
In radix_sort_pairs there is static check that the size of value_it is 1, 2, 4, 8, so I only handle these four cases, and for saftety reason use char[N] for falllback. I think currently we do not need to add float4 as mostly this value type is used for index type which is int_64 or int32.
using opaque_t = detail::OpaqueType<sizeof(value_t)>;
static_assert(sizeof(value_t) <= 8 && (sizeof(value_t) & (sizeof(value_t) - 1)) == 0,
"This size of value_t is not instantiated. Please instantiate it in cub.cu"
" and modify this check.");
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
I have tested with CUDA 12.6 and CUDA 12.8 on various GPUs(A100, H100, H20), and they all have such issue. The performance table in the PR is based on CUDA 12.8. As 12.8 is still one of the main versions, I submit this PR for fix. From the PTX of CUDA 12.6 I explicitly notice the following when using char[8]: However when I use uint64_t, there is only one instruction Besides, Opaque type is used in other kernels too. There is separate OpaqueType definition in IndexKernel.cu, Shape.cu, ScatterGatherKernel.cu. I have tested some relevant ops like |
Summary
This PR improves
torch.sortandtorch.uniqueperformance by 15% to 50% on NVIDIA GPUs by optimizing CUDA register allocation in radix sort operations.The key change: specialize
OpaqueType<N>to use native integer types (uint8_t, uint16_t, uint32_t, uint64_t) for common sizes (1, 2, 4, 8 bytes) instead ofchar data[N]. This enables more efficient register allocation while preserving the template deduplication strategy.The following table shows the speedup on various input shapes and GPUs. Sorting is performed on the last dimension, and baseline torch version is 2.9.0.
Analysis
torch.sortandtorch.uniqueuseradix_sort_pairs, which internally callscub::DeviceRadixSort::SortPairs. Since values are only copied (never compared), we cast them toOpaqueType<sizeof(value_t)>to minimize template instantiations. For example, bothint32andfloat32values map to the sameOpaqueType<4>.The Problem
The previous
char data[N]implementation causes inefficient register allocation. Here is one reason I find from SASS code. For 8-byte types:char data[8]:Compiler may allocate 8 registers (one per byte)uint64_t data: Compiler allocates 2 registers (standard 64-bit handling)This happens because the compiler doesn't recognize char[8] as a cohesive 64-bit value, treating each byte independently, which increases register pressure and reduces GPU occupancy.
From Nsight Compute, when using
char data[8], the registers per thread is 166, and corresponding theoretical occupancy is 18.75%. When using nativeuint64_t, the registers per thread is 80, and corresponding theoretical occupancy is 37.5%.The Solution
Specialize
OpaqueType<N>for common sizes using native integer types:This preserves the template deduplication strategy (all 8-byte types still use the same
OpaqueType<8>instantiation) while enabling better register allocation.Testing & Compatibility
Testing:
✅ Correctness tests pass for various input types (bfloat16, int32, float32, int64), shapes, and dimensions (1, 2, 3)
✅ Register usage reduction verified with NSight Compute
✅ Linter passes
Compatibility:
✅ No API/ABI changes
✅ Template instantiation count unchanged
Reference
For detailed analysis, please refere to my previous blog: Performance Optimization of torch.sort on GPU