[ATEN][CUDA] Reduce register pressure in radix_sort_pairs to improve torch.sort performance by YyWangCS · Pull Request #167094 · pytorch/pytorch

YyWangCS · 2025-11-05T16:58:20Z

Summary

This PR improves torch.sort and torch.unique performance by 15% to 50% on NVIDIA GPUs by optimizing CUDA register allocation in radix sort operations.

The key change: specialize OpaqueType<N> to use native integer types (uint8_t, uint16_t, uint32_t, uint64_t) for common sizes (1, 2, 4, 8 bytes) instead of char data[N]. This enables more efficient register allocation while preserving the template deduplication strategy.

The following table shows the speedup on various input shapes and GPUs. Sorting is performed on the last dimension, and baseline torch version is 2.9.0.

GPU	input shape	input dtype	Before (ms)	After (ms)	Speedup
H100	(16, 1e6)	int32	1.61	1.37	1.18×
H100	(1, 1e8)	int32	6.6	5.0	1.3×
H20	(16, 1e6)	int64	3.57	3.03	1.18×
H20	(1, 1e8)	int64	19.3	13.0	1.48×

Analysis

torch.sort and torch.unique use radix_sort_pairs, which internally calls cub::DeviceRadixSort::SortPairs. Since values are only copied (never compared), we cast them to OpaqueType<sizeof(value_t)> to minimize template instantiations. For example, both int32 and float32 values map to the same OpaqueType<4>.

The Problem

The previous char data[N] implementation causes inefficient register allocation. Here is one reason I find from SASS code. For 8-byte types:

char data[8]: Compiler may allocate 8 registers (one per byte)
uint64_t data: Compiler allocates 2 registers (standard 64-bit handling)

This happens because the compiler doesn't recognize char[8] as a cohesive 64-bit value, treating each byte independently, which increases register pressure and reduces GPU occupancy.

From Nsight Compute, when using char data[8], the registers per thread is 166, and corresponding theoretical occupancy is 18.75%. When using native uint64_t, the registers per thread is 80, and corresponding theoretical occupancy is 37.5%.

The Solution

Specialize OpaqueType<N> for common sizes using native integer types:

// Before
template <int N> struct alignas(N) OpaqueType { char data[N]; };

// After
template <int N> struct alignas(N) OpaqueType { char data[N]; }; // fallback
template <> struct alignas(1) OpaqueType<1> { uint8_t data; };
template <> struct alignas(2) OpaqueType<2> { uint16_t data; };
template <> struct alignas(4) OpaqueType<4> { uint32_t data; };
template <> struct alignas(8) OpaqueType<8> { uint64_t data; };

This preserves the template deduplication strategy (all 8-byte types still use the same OpaqueType<8> instantiation) while enabling better register allocation.

Testing & Compatibility

Testing:

✅ Correctness tests pass for various input types (bfloat16, int32, float32, int64), shapes, and dimensions (1, 2, 3)
✅ Register usage reduction verified with NSight Compute
✅ Linter passes

Compatibility:

✅ No API/ABI changes
✅ Template instantiation count unchanged

Reference

For detailed analysis, please refere to my previous blog: Performance Optimization of torch.sort on GPU

…torch.topk performance

pytorch-bot · 2025-11-05T16:58:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167094

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 69a1c26 with merge base 0b4dd08 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

YyWangCS · 2025-11-05T17:22:54Z

@pytorchbot label "release notes: cuda"

YyWangCS · 2025-11-05T17:35:37Z

cc @peterbell10

ngimel

It's interesting that compiler is not doing it by itself. Thanks!

Skylion007 · 2025-11-05T19:49:20Z

aten/src/ATen/cuda/cub.h

 // the data, so we can save template instantiations by reinterpreting
 // it as an opaque type.
+// We use native integer types for 1/2/4/8-byte values to reduce
+// register usage in CUDA kernels. For sizes > 8 fall back to char array.


If only uint_128t existed in the C++ standard. Clang/GCC have it as extensions i think, sigh...

Actually, it occured to me. Does uint4_t exist in CUDA for float4? I guess wouldn't even matter in this case for register usage, would it?

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#host-compiler-extensions Apparently CUDA supports it, should we add it?

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#host-compiler-extensions Apparently CUDA supports it, should we add it?

In radix_sort_pairs there is static check that the size of value_it is 1, 2, 4, 8, so I only handle these four cases, and for saftety reason use char[N] for falllback. I think currently we do not need to add float4 as mostly this value type is used for index type which is int_64 or int32.

using opaque_t = detail::OpaqueType<sizeof(value_t)>; static_assert(sizeof(value_t) <= 8 && (sizeof(value_t) & (sizeof(value_t) - 1)) == 0, "This size of value_t is not instantiated. Please instantiate it in cub.cu" " and modify this check.");

Skylion007 · 2025-11-05T19:49:55Z

@pytorchbot merge

pytorchmergebot · 2025-11-05T19:51:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

YyWangCS · 2025-11-06T01:47:52Z

It's interesting that compiler is not doing it by itself. Thanks!

I have tested with CUDA 12.6 and CUDA 12.8 on various GPUs(A100, H100, H20), and they all have such issue. The performance table in the PR is based on CUDA 12.8. As 12.8 is still one of the main versions, I submit this PR for fix.

From the PTX of CUDA 12.6 I explicitly notice the following when using char[8]:

ld.global.v4.u16 {%rs2123, %rs2121, %rs2119, %rs2117}, [%rd5]
shr.u16  %rs2122, %rs2123, 8;
shr.u16  %rs2120, %rs2121, 8;
shr.u16  %rs2118, %rs2119, 8;
shr.u16  %rs2116, %rs2117, 8;

However when I use uint64_t, there is only one instruction ld.global.u64 %rd506, [%rd7]. From NSight Compute view, when using uint_64_t, the register per thread is 80, and when using char[8], the register per thread is 166.

Besides, Opaque type is used in other kernels too. There is separate OpaqueType definition in IndexKernel.cu, Shape.cu, ScatterGatherKernel.cu. I have tested some relevant ops like torch.cat in Shape.cu , and find they are not register bound. So I did not handle these Opaque types.
cc @Skylion007

YyWangCS added 4 commits November 4, 2025 10:03

[ATEN][CUDA] Reduce register pressure in radix_sort_pairs to improve …

bb03b1e

…torch.topk performance

Make the code more clean

be69e06

Merge branch 'main' into wyy/opt_sort_occupancy

7664b99

Make the code style consistent

69a1c26

YyWangCS requested review from Aidyn-A, eqy and syed-ahmed as code owners November 5, 2025 16:58

pytorchbot added the open source label Nov 5, 2025

pytorch-bot bot added the release notes: cuda release notes category label Nov 5, 2025

Skylion007 requested a review from ngimel November 5, 2025 18:00

ngimel approved these changes Nov 5, 2025

View reviewed changes

Skylion007 approved these changes Nov 5, 2025

View reviewed changes

Skylion007 added module: performance Issues related to performance, either of kernel code or framework glue topic: performance topic category labels Nov 5, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 5, 2025

pytorchmergebot added the merging label Nov 5, 2025

pytorchmergebot added the Merged label Nov 5, 2025

pytorchmergebot closed this in 47eb34b Nov 5, 2025

pytorchmergebot removed the merging label Nov 5, 2025

benediktjohannes mentioned this pull request Jan 26, 2026

[NATIVE] [CUDA] OpaqueType specialisation for performance improvements #173411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ATEN][CUDA] Reduce register pressure in radix_sort_pairs to improve torch.sort performance#167094

[ATEN][CUDA] Reduce register pressure in radix_sort_pairs to improve torch.sort performance#167094
YyWangCS wants to merge 4 commits intopytorch:mainfrom
YyWangCS:wyy/opt_sort_occupancy

YyWangCS commented Nov 5, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 5, 2025 •

edited

Loading

Uh oh!

YyWangCS commented Nov 5, 2025

Uh oh!

YyWangCS commented Nov 5, 2025

Uh oh!

ngimel left a comment

Uh oh!

Skylion007 Nov 5, 2025

Uh oh!

Skylion007 Nov 5, 2025 •

edited

Loading

Uh oh!

Skylion007 Nov 5, 2025

Uh oh!

YyWangCS Nov 6, 2025 •

edited

Loading

Uh oh!

Skylion007 commented Nov 5, 2025

Uh oh!

pytorchmergebot commented Nov 5, 2025

Uh oh!

YyWangCS commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

YyWangCS commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Analysis

The Problem

The Solution

Testing & Compatibility

Testing:

Compatibility:

Reference

Uh oh!

pytorch-bot bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167094

✅ No Failures

Uh oh!

YyWangCS commented Nov 5, 2025

Uh oh!

YyWangCS commented Nov 5, 2025

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

Skylion007 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Skylion007 Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Skylion007 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

YyWangCS Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Skylion007 commented Nov 5, 2025

Uh oh!

pytorchmergebot commented Nov 5, 2025

Merge started

Uh oh!

YyWangCS commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

YyWangCS commented Nov 5, 2025 •

edited

Loading

pytorch-bot bot commented Nov 5, 2025 •

edited

Loading

Skylion007 Nov 5, 2025 •

edited

Loading

YyWangCS Nov 6, 2025 •

edited

Loading