torch.topk: refactor global histogram/cumsum into a dedicated kernel to eliminate redundant memory access by YyWangCS · Pull Request #164459 · pytorch/pytorch

YyWangCS · 2025-10-02T11:35:56Z

TLDR

This PR removes the regression in torch.topk introduced from torch 2.7.0 and delivers much better performance for large inputs.

The table below reports execution times on H20 for various input sizes with float32 data, extracting the top-100 values. Results indicate that this PR restores and improves performance, especially on large inputs.

Input Shape	torch2.6.0 (ms)	torch2.8.0 (ms)	2.8.0+this PR (ms)
(1, 1B)	36.6	1564.1	25.6
(1, 100M)	3.56	17.4	2.54
(1, 1000,000)	0.135	0.145	0.098
(512, 128000)	1.33	1.33	1.32
(8192, 128000)	19.6	19.6	19.4

Background

After upgrading PyTorch from 2.6.0 to 2.7.0, we observed a significant GPU performance regression in torch.topk on NVIDIA GPUs. For instance, extracting the top-1000 largest values from one billion floats on an NVIDIA H20 increased from 36 ms to 1.6 s.

Profiling with Nsight Compute indicates that the slowdown is caused by redundant memory accesses introduced in PR #145536.

Analysis

torch.topk relies on RadixSelect to find the target values. Each radix pass requires computing a histogram of the input values. For large inputs, histogram computation is split into two stages:

Local histogram: Each CUDA block processes a subset of the input and writes its local histogram to global memory.
Global reduction: A single CUDA block reads all local histograms from global memory and reduces them into the final global histogram.

Before PR #145536, both stages ran inside a single kernel (radixFindKthValues), using a semaphore to ensure that all local histograms were completed before reduction.

In PR #145536, the global histogram computation was merged with subsequent top-k calculations into a single kernel (computeBlockwiseKthCounts) to avoid the semaphore. While this simplifies synchronization, it introduces redundant memory reads:

computeBlockwiseKthCounts launches numInputSlices * blocks_per_slice blocks.
For each row (slice), blocks_per_slice CUDA blocks redundantly reload the same local histograms from global memory.

This PR

To address this inefficiency, we introduce the following optimizations:

Dedicated kernel: Refactor global histogram and cumsum computation into a separate GPU kernel, computeDigitCumSum.
Loop unrolling: Apply loop unrolling in computeDigitCumSum to speed up local histogram reads.

Performance

We benchmarked torch.topk on NVIDIA H20 with float32 inputs, extracting the top-100 values across different input sizes. The results in the table below demonstrate that this PR effectively eliminates the performance regression introduced in 2.7.0 and delivers substantial improvements on large inputs.

Input Shape	torch2.6.0 (ms)	torch2.8.0 (ms)	2.8.0+this PR (ms)
(1, 1B)	36.6	1564.1	25.6
(1, 100M)	3.56	17.4	2.54
(1, 1000,000)	0.135	0.145	0.098
(512, 128000)	1.33	1.33	1.32
(8192, 128000)	19.6	19.6	19.4

Besides, I have verified the correctness of this PR with different inputs.

… avoid redundant memory access

pytorch-bot · 2025-10-02T11:36:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164459

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a5dc805 with merge base 39c340e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

YyWangCS · 2025-10-03T14:31:59Z

cc @ngimel

YyWangCS · 2025-10-07T07:41:30Z

@eqy This PR is approved by two reviewers and could you help merge it?

cyyever · 2025-10-07T07:46:31Z

@pytorchbot merge

cyyever · 2025-10-07T07:47:50Z

@YyWangCS You can use @pytorchbot to self-help.

pytorchmergebot · 2025-10-07T07:48:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…to eliminate redundant memory access (pytorch#164459) # TLDR This PR removes the regression in torch.topk introduced from torch 2.7.0 and delivers much better performance for large inputs. The table below reports execution times on H20 for various input sizes with float32 data, extracting the top-100 values. Results indicate that this PR restores and improves performance, especially on large inputs. | Input Shape | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) | | -------------- | --------------- | --------------- | ------------------ | | (1, 1B) | 36.6 | 1564.1 | 25.6 | | (1, 100M) | 3.56 | 17.4 | 2.54 | | (1, 1000,000) | 0.135 | 0.145 | 0.098 | | (512, 128000) | 1.33 | 1.33 | 1.32 | | (8192, 128000) | 19.6 | 19.6 | 19.4 | # Background After upgrading PyTorch from 2.6.0 to 2.7.0, we observed a significant GPU performance regression in `torch.topk` on NVIDIA GPUs. For instance, extracting the top-1000 largest values from one billion floats on an NVIDIA H20 increased from **36 ms** to **1.6 s**. Profiling with Nsight Compute indicates that the slowdown is caused by redundant memory accesses introduced in [PR pytorch#145536](pytorch#145536). # Analysis `torch.topk` relies on **RadixSelect** to find the target values. Each radix pass requires computing a histogram of the input values. For large inputs, histogram computation is split into two stages: 1. **Local histogram**: Each CUDA block processes a subset of the input and writes its local histogram to global memory. 2. **Global reduction**: A single CUDA block reads all local histograms from global memory and reduces them into the final global histogram. Before [PR pytorch#145536](pytorch#145536), both stages ran inside a single kernel (`radixFindKthValues`), using a semaphore to ensure that all local histograms were completed before reduction. In PR pytorch#145536, the global histogram computation was merged with subsequent top-k calculations into a single kernel (`computeBlockwiseKthCounts`) to avoid the semaphore. While this simplifies synchronization, it introduces **redundant memory reads**: - `computeBlockwiseKthCounts` launches `numInputSlices * blocks_per_slice` blocks. - For each row (slice), `blocks_per_slice` CUDA blocks redundantly reload the same local histograms from global memory. # This PR To address this inefficiency, we introduce the following optimizations: 1. **Dedicated kernel**: Refactor global histogram and cumsum computation into a separate GPU kernel, `computeDigitCumSum`. 2. **Loop unrolling**: Apply loop unrolling in `computeDigitCumSum` to speed up local histogram reads. # Performance We benchmarked torch.topk on NVIDIA H20 with float32 inputs, extracting the top-100 values across different input sizes. The results in the table below demonstrate that this PR effectively eliminates the performance regression introduced in 2.7.0 and delivers substantial improvements on large inputs. | Input Shape | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) | | -------------- | --------------- | --------------- | ------------------ | | (1, 1B) | 36.6 | 1564.1 | 25.6 | | (1, 100M) | 3.56 | 17.4 | 2.54 | | (1, 1000,000) | 0.135 | 0.145 | 0.098 | | (512, 128000) | 1.33 | 1.33 | 1.32 | | (8192, 128000) | 19.6 | 19.6 | 19.4 | Besides, I have verified the correctness of this PR with different inputs. Pull Request resolved: pytorch#164459 Approved by: https://github.com/ngimel, https://github.com/Skylion007

Refactor global histogram/cumsum calculation into dedicated kernel to…

e734b29

… avoid redundant memory access

YyWangCS requested review from eqy and syed-ahmed as code owners October 2, 2025 11:35

pytorch-bot bot added the release notes: cuda release notes category label Oct 2, 2025

pytorchbot added the open source label Oct 2, 2025

YyWangCS added 2 commits October 2, 2025 20:04

torch.topk: fix lint error

667e6c3

torch.topk: fix lint errors

e6db491

eqy requested a review from zasdfgbnm October 2, 2025 14:58

YyWangCS added 2 commits October 2, 2025 23:08

torch.topk: fix small white space issue

020fa49

torch.topk: fix small code style issue

a5dc805

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 2, 2025

ngimel approved these changes Oct 3, 2025

View reviewed changes

Skylion007 approved these changes Oct 3, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 7, 2025

pytorchmergebot added the merging label Oct 7, 2025

pytorchmergebot added the Merged label Oct 7, 2025

pytorchmergebot closed this in 3cc8af2 Oct 7, 2025

pytorchmergebot removed the merging label Oct 7, 2025

leeeizhang mentioned this pull request Nov 10, 2025

torch.topk speed regression #167462

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.topk: refactor global histogram/cumsum into a dedicated kernel to eliminate redundant memory access#164459

torch.topk: refactor global histogram/cumsum into a dedicated kernel to eliminate redundant memory access#164459
YyWangCS wants to merge 5 commits intopytorch:mainfrom
YyWangCS:YyWangCS/optimize_topk

YyWangCS commented Oct 2, 2025

Uh oh!

pytorch-bot bot commented Oct 2, 2025 •

edited

Loading

Uh oh!

YyWangCS commented Oct 3, 2025 •

edited

Loading

Uh oh!

YyWangCS commented Oct 7, 2025

Uh oh!

cyyever commented Oct 7, 2025

Uh oh!

cyyever commented Oct 7, 2025

Uh oh!

pytorchmergebot commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

YyWangCS commented Oct 2, 2025

TLDR

Background

Analysis

This PR

Performance

Uh oh!

pytorch-bot bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164459

✅ No Failures

Uh oh!

YyWangCS commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YyWangCS commented Oct 7, 2025

Uh oh!

cyyever commented Oct 7, 2025

Uh oh!

cyyever commented Oct 7, 2025

Uh oh!

pytorchmergebot commented Oct 7, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pytorch-bot bot commented Oct 2, 2025 •

edited

Loading

YyWangCS commented Oct 3, 2025 •

edited

Loading