Skip to content

[ATEN][CUDA] Reduce register pressure in radix_sort_pairs to improve torch.sort performance#167094

Closed
YyWangCS wants to merge 4 commits intopytorch:mainfrom
YyWangCS:wyy/opt_sort_occupancy
Closed

[ATEN][CUDA] Reduce register pressure in radix_sort_pairs to improve torch.sort performance#167094
YyWangCS wants to merge 4 commits intopytorch:mainfrom
YyWangCS:wyy/opt_sort_occupancy

Conversation

@YyWangCS
Copy link
Contributor

@YyWangCS YyWangCS commented Nov 5, 2025

Summary

This PR improves torch.sort and torch.unique performance by 15% to 50% on NVIDIA GPUs by optimizing CUDA register allocation in radix sort operations.

The key change: specialize OpaqueType<N> to use native integer types (uint8_t, uint16_t, uint32_t, uint64_t) for common sizes (1, 2, 4, 8 bytes) instead of char data[N]. This enables more efficient register allocation while preserving the template deduplication strategy.

The following table shows the speedup on various input shapes and GPUs. Sorting is performed on the last dimension, and baseline torch version is 2.9.0.

GPU input shape input dtype Before (ms) After (ms) Speedup
H100 (16, 1e6) int32 1.61 1.37 1.18×
H100 (1, 1e8) int32 6.6 5.0 1.3×
H20 (16, 1e6) int64 3.57 3.03 1.18×
H20 (1, 1e8) int64 19.3 13.0 1.48×

Analysis

torch.sort and torch.unique use radix_sort_pairs, which internally calls cub::DeviceRadixSort::SortPairs. Since values are only copied (never compared), we cast them to OpaqueType<sizeof(value_t)> to minimize template instantiations. For example, both int32 and float32 values map to the same OpaqueType<4>.

The Problem

The previous char data[N] implementation causes inefficient register allocation. Here is one reason I find from SASS code. For 8-byte types:

  • char data[8]: Compiler may allocate 8 registers (one per byte)

  • uint64_t data: Compiler allocates 2 registers (standard 64-bit handling)

This happens because the compiler doesn't recognize char[8] as a cohesive 64-bit value, treating each byte independently, which increases register pressure and reduces GPU occupancy.

From Nsight Compute, when using char data[8], the registers per thread is 166, and corresponding theoretical occupancy is 18.75%. When using native uint64_t, the registers per thread is 80, and corresponding theoretical occupancy is 37.5%.

The Solution

Specialize OpaqueType<N> for common sizes using native integer types:

// Before
template <int N> struct alignas(N) OpaqueType { char data[N]; };

// After
template <int N> struct alignas(N) OpaqueType { char data[N]; }; // fallback
template <> struct alignas(1) OpaqueType<1> { uint8_t data; };
template <> struct alignas(2) OpaqueType<2> { uint16_t data; };
template <> struct alignas(4) OpaqueType<4> { uint32_t data; };
template <> struct alignas(8) OpaqueType<8> { uint64_t data; };

This preserves the template deduplication strategy (all 8-byte types still use the same OpaqueType<8> instantiation) while enabling better register allocation.

Testing & Compatibility

Testing:

✅ Correctness tests pass for various input types (bfloat16, int32, float32, int64), shapes, and dimensions (1, 2, 3)
✅ Register usage reduction verified with NSight Compute
✅ Linter passes

Compatibility:

✅ No API/ABI changes
✅ Template instantiation count unchanged

Reference

For detailed analysis, please refere to my previous blog: Performance Optimization of torch.sort on GPU

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167094

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 69a1c26 with merge base 0b4dd08 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@YyWangCS
Copy link
Contributor Author

YyWangCS commented Nov 5, 2025

@pytorchbot label "release notes: cuda"

@pytorch-bot pytorch-bot bot added the release notes: cuda release notes category label Nov 5, 2025
@YyWangCS
Copy link
Contributor Author

YyWangCS commented Nov 5, 2025

cc @peterbell10

@Skylion007 Skylion007 requested a review from ngimel November 5, 2025 18:00
Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that compiler is not doing it by itself. Thanks!

// the data, so we can save template instantiations by reinterpreting
// it as an opaque type.
// We use native integer types for 1/2/4/8-byte values to reduce
// register usage in CUDA kernels. For sizes > 8 fall back to char array.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only uint_128t existed in the C++ standard. Clang/GCC have it as extensions i think, sigh...

Copy link
Collaborator

@Skylion007 Skylion007 Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it occured to me. Does uint4_t exist in CUDA for float4? I guess wouldn't even matter in this case for register usage, would it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@YyWangCS YyWangCS Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#host-compiler-extensions Apparently CUDA supports it, should we add it?

In radix_sort_pairs there is static check that the size of value_it is 1, 2, 4, 8, so I only handle these four cases, and for saftety reason use char[N] for falllback. I think currently we do not need to add float4 as mostly this value type is used for index type which is int_64 or int32.

  using opaque_t = detail::OpaqueType<sizeof(value_t)>;
  static_assert(sizeof(value_t) <= 8 && (sizeof(value_t) & (sizeof(value_t) - 1)) == 0,
                "This size of value_t is not instantiated. Please instantiate it in cub.cu"
                " and modify this check.");

@Skylion007 Skylion007 added module: performance Issues related to performance, either of kernel code or framework glue topic: performance topic category labels Nov 5, 2025
@Skylion007
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 5, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@YyWangCS
Copy link
Contributor Author

YyWangCS commented Nov 6, 2025

It's interesting that compiler is not doing it by itself. Thanks!

I have tested with CUDA 12.6 and CUDA 12.8 on various GPUs(A100, H100, H20), and they all have such issue. The performance table in the PR is based on CUDA 12.8. As 12.8 is still one of the main versions, I submit this PR for fix.

From the PTX of CUDA 12.6 I explicitly notice the following when using char[8]:

ld.global.v4.u16 {%rs2123, %rs2121, %rs2119, %rs2117}, [%rd5]
shr.u16  %rs2122, %rs2123, 8;
shr.u16  %rs2120, %rs2121, 8;
shr.u16  %rs2118, %rs2119, 8;
shr.u16  %rs2116, %rs2117, 8;

However when I use uint64_t, there is only one instruction ld.global.u64 %rd506, [%rd7]. From NSight Compute view, when using uint_64_t, the register per thread is 80, and when using char[8], the register per thread is 166.

Besides, Opaque type is used in other kernels too. There is separate OpaqueType definition in IndexKernel.cu, Shape.cu, ScatterGatherKernel.cu. I have tested some relevant ops like torch.cat in Shape.cu , and find they are not register bound. So I did not handle these Opaque types.
cc @Skylion007

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: performance Issues related to performance, either of kernel code or framework glue open source release notes: cuda release notes category topic: performance topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants