[ATEN][CUDA] Reduce register pressure introduced by CUDA_KERNEL_ASSERT to improve torch.EmbeddingBag performance by YyWangCS · Pull Request #167834 · pytorch/pytorch

YyWangCS · 2025-11-14T13:44:40Z

Summary

This PR optimizes the CUDA kernels for torch.nn.EmbeddingBag by reducing GPU register pressure introduced by CUDA_KERNEL_ASSERT, which improves kernel occupancy and overall performance. The optimization separates input validation into a dedicated loop before the main processing loop, allowing the compiler to better optimize register allocation. By extensively testing on various GPUs and CUDA versions, torch.nn.EmbeddingBag performance improves by 29% to 111% with this PR.

Performance Results

The following table shows the performance improvements on various input distributions and GPUs. All benchmarks use PyTorch 2.9.0 compiled with CUDA 12.8.

Input Distribution Types (simulating recommendation system ID patterns):

random id: Randomly sampled embedding indices from the full vocabulary (uniform distribution)
one-hot: One ID appears with very high frequency across all bags, simulating a popular item in recommendation systems
multi-hot: Multiple IDs appear with high frequency across all bags, simulating multiple popular items in recommendation systems

Test Configuration:

Embedding shape: (5000000, 128) (5M vocabulary size, 128-dimensional embeddings)
Batch size: 2048 bags
Average bag size: 150 indices per bag

GPU	Input Distribution	Before (µs)	After (µs)	Speedup
H100	random id	162.4	105.9	1.53×
H100	one-hot	120.4	88.6	1.36×
H100	multi-hot	113.1	87.8	1.29×
H20	random id	278.6	132.2	2.11×
H20	one-hot	189.7	110.3	1.72×
H20	multi-hot	172.4	107.4	1.61×

Motivation

The original implementation performed bounds checking using CUDA_KERNEL_ASSERT inline within the main processing loop, which increased register pressure and limited GPU occupancy. From NSight Compute analysis on H20, using PyTorch 2.9 compiled with CUDA 12.8, removing the CUDA_KERNEL_ASSERT from the main loop with this PR increases the overall occupancy from 50% to 75%(registers per thread 52->40).

By separating validation into a dedicated loop, we:

Reduce register pressure in the main loop: The validation loop uses minimal registers, allowing the compiler to optimize the main processing loop independently with better register allocation.
Maintain correctness: All input validation is still performed, but in a more register-efficient manner.

Changes

Modified Kernels

EmbeddingBag_updateOutputKernel_max: Added separate validation loop before main processing
EmbeddingBag_updateOutputKernel_sum_mean: Added separate validation loop before main processing

Key Implementation Details

Separate validation loop: Input indices are validated in a dedicated loop that checks all indices before processing begins
No early exit: The validation loop intentionally avoids using break for early exit, as benchmarking showed that early exit degrades performance, possibly due to increased branch divergence and reduced instruction-level parallelism
Consistent error messages: Improved error message clarity for invalid input indices
Design choice: validation loop vs. separate kernel: We considered removing CUDA_KERNEL_ASSERT entirely and performing bounds checking in a separate GPU kernel, which would achieve even better performance (e.g., on H20 with random id distribution: 132.2 µs → 124.6 µs). However, this approach is harder to maintain as it requires coordinating two separate kernel launches and managing additional kernel launch overhead. Instead, we chose the current approach of using a separate validation loop within the same kernel, which provides a good balance between performance improvement and code maintainability.

Code Changes

// Separate validation loop reduces register pressure in the main loop below.
// No early exit (break) on invalid input as benchmarking shows it degrades performance.
bool has_invalid_index = false;
for (int64_t emb = begin; emb < end; emb++) {
  index_t input_idx = input[emb];
  has_invalid_index = has_invalid_index || (input_idx < 0 || input_idx >= numRows);
}
CUDA_KERNEL_ASSERT(!has_invalid_index && "Invalid input index in EmbeddingBag: index out of range [0, numRows)");

// Main processing loop (now with reduced register pressure)
for (int64_t emb = begin; emb < end; emb++) {
  // ... processing logic ...
}

Testing & Compatibility

Performance Testing

I conducted extensive performance testing across multiple configurations. All tests show significant performance improvements:

Tested CUDA Versions:

CUDA 12.6, 12.8, 13.0

Tested GPU Architectures:

A100, H20, H100

Tested Input Configurations:

Embedding shapes: Various sizes including [5000000, 128] and [128000, 4096]
Embedding dtypes: torch.float32, torch.float16
Input distributions: Random indices, one-hot (high-frequency single ID), and multi-hot (high-frequency multiple IDs) patterns, simulating recommendation system workloads
Input sizes: Average bag sizes of 150, 20, and 10 indices per bag

Correctness Testing

✅ Correctness tests pass for various embedding types (bfloat16, float32), shapes, and input distributions
✅ Register usage reduction verified with NSight Compute
✅ Linter passes

Compatibility

✅ No API/ABI changes

…ag performance

pytorch-bot · 2025-11-14T13:44:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167834

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 3680b61 with merge base 226850c ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

trunk / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge, unstable) (gh) (#166072)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe

This comment was automatically generated by Dr. CI and updates every 15 minutes.

YyWangCS · 2025-11-14T13:50:23Z

cc @xw285cornell @zeshengzong @ngimel

pytorch-bot · 2025-11-14T16:53:00Z

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

eqy

Did you also explore e.g., moving the CUDA_KERNEL_ASSERT out of the mainloop but keeping the has_invalid_index check inside it since these appear to be orthogonal changes?

ngimel · 2025-11-14T17:53:46Z

@eqy you can't move CUDA_KERNEL_ASSERT out of the main loop because you'll ima on the read if CUDA_KERNEL_ASSERT would have fired

eqy · 2025-11-14T18:03:22Z

@YyWangCS do you have the benchmark script you used for this? interested in playing around with it

eqy · 2025-11-14T19:29:37Z

nvm, I vibecoded a benchmark script and tried out evaluating the check in the existing loop (with a branch to avoid IMA) and it made things worse lol

ngimel · 2025-11-14T19:33:16Z

Apparently branching itself is the problem

YyWangCS · 2025-11-15T00:05:57Z

nvm, I vibecoded a benchmark script and tried out evaluating the check in the existing loop (with a branch to avoid IMA) and it made things worse lol

I have tested the following solutions, with the baseline time 278.6 µs as shown in the 4th row of the table:

Our solution, which uses separate validation loop, and the latency is 132.2 µs
Remove CUDA_KERNEL_ASSERT completely without any validation, the latency is 124.6 µs
Use a separate validation loop, but use CUDA_KERNEL_ASSERT in the loop, latency is 148 µs
Use break for early exit in the validation loop, I do not remember the exact latency number, but it is much worse than 132.2 µs.

This is why I choose the current solution. Besides, I also tested on different CUDA versions(12.6, 12.8, 13.0), and they have such performance issue.
cc @ngimel

YyWangCS · 2025-11-15T00:06:27Z

@pytorchbot merge

pytorchmergebot · 2025-11-15T00:08:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-11-15T02:04:42Z

This PR (#167834) was merged in d7782dd but it is still open, likely due to a Github bug, so mergebot is closing it manually. If you think this is a mistake, please feel free to reopen and contact Dev Infra.

…T to improve torch.EmbeddingBag performance (pytorch#167834) # Summary This PR optimizes the CUDA kernels for `torch.nn.EmbeddingBag` by reducing GPU register pressure introduced by `CUDA_KERNEL_ASSERT`, which improves kernel occupancy and overall performance. The optimization separates input validation into a dedicated loop before the main processing loop, allowing the compiler to better optimize register allocation. By extensively testing on various GPUs and CUDA versions, `torch.nn.EmbeddingBag` performance improves by 29% to 111% with this PR. # Performance Results The following table shows the performance improvements on various input distributions and GPUs. All benchmarks use PyTorch 2.9.0 compiled with CUDA 12.8. **Input Distribution Types (simulating recommendation system ID patterns):** - **random id**: Randomly sampled embedding indices from the full vocabulary (uniform distribution) - **one-hot**: One ID appears with very high frequency across all bags, simulating a popular item in recommendation systems - **multi-hot**: Multiple IDs appear with high frequency across all bags, simulating multiple popular items in recommendation systems **Test Configuration:** - Embedding shape: `(5000000, 128)` (5M vocabulary size, 128-dimensional embeddings) - Batch size: 2048 bags - Average bag size: 150 indices per bag | GPU | Input Distribution | Before (µs) | After (µs) | Speedup | | ---- | ------------------ | ----------- | ---------- | ------- | | H100 | random id | 162.4 | 105.9 | 1.53× | | H100 | one-hot | 120.4 | 88.6 | 1.36× | | H100 | multi-hot | 113.1 | 87.8 | 1.29× | | H20 | random id | 278.6 | 132.2 | 2.11× | | H20 | one-hot | 189.7 | 110.3 | 1.72× | | H20 | multi-hot | 172.4 | 107.4 | 1.61× | # Motivation The original implementation performed bounds checking using `CUDA_KERNEL_ASSERT` inline within the main processing loop, which increased register pressure and limited GPU occupancy. From NSight Compute analysis on H20, using PyTorch 2.9 compiled with CUDA 12.8, removing the `CUDA_KERNEL_ASSERT` from the main loop with this PR increases the overall occupancy from 50% to 75%(registers per thread 52->40). By separating validation into a dedicated loop, we: 1. **Reduce register pressure in the main loop**: The validation loop uses minimal registers, allowing the compiler to optimize the main processing loop independently with better register allocation. 2. **Maintain correctness**: All input validation is still performed, but in a more register-efficient manner. # Changes ## Modified Kernels 1. **`EmbeddingBag_updateOutputKernel_max`**: Added separate validation loop before main processing 2. **`EmbeddingBag_updateOutputKernel_sum_mean`**: Added separate validation loop before main processing ## Key Implementation Details - **Separate validation loop**: Input indices are validated in a dedicated loop that checks all indices before processing begins - **No early exit**: The validation loop intentionally avoids using `break` for early exit, as benchmarking showed that early exit degrades performance, possibly due to increased branch divergence and reduced instruction-level parallelism - **Consistent error messages**: Improved error message clarity for invalid input indices - **Design choice: validation loop vs. separate kernel**: We considered removing `CUDA_KERNEL_ASSERT` entirely and performing bounds checking in a separate GPU kernel, which would achieve even better performance (e.g., on H20 with random id distribution: 132.2 µs → 124.6 µs). However, this approach is harder to maintain as it requires coordinating two separate kernel launches and managing additional kernel launch overhead. Instead, we chose the current approach of using a separate validation loop within the same kernel, which provides a good balance between performance improvement and code maintainability. ## Code Changes ```cpp // Separate validation loop reduces register pressure in the main loop below. // No early exit (break) on invalid input as benchmarking shows it degrades performance. bool has_invalid_index = false; for (int64_t emb = begin; emb < end; emb++) { index_t input_idx = input[emb]; has_invalid_index = has_invalid_index || (input_idx < 0 || input_idx >= numRows); } CUDA_KERNEL_ASSERT(!has_invalid_index && "Invalid input index in EmbeddingBag: index out of range [0, numRows)"); // Main processing loop (now with reduced register pressure) for (int64_t emb = begin; emb < end; emb++) { // ... processing logic ... } ``` # Testing & Compatibility ## Performance Testing I conducted extensive performance testing across multiple configurations. All tests show significant performance improvements: **Tested CUDA Versions:** - CUDA 12.6, 12.8, 13.0 **Tested GPU Architectures:** - A100, H20, H100 **Tested Input Configurations:** - **Embedding shapes**: Various sizes including `[5000000, 128]` and `[128000, 4096]` - **Embedding dtypes**: `torch.float32`, `torch.float16` - **Input distributions**: Random indices, one-hot (high-frequency single ID), and multi-hot (high-frequency multiple IDs) patterns, simulating recommendation system workloads - **Input sizes**: Average bag sizes of 150, 20, and 10 indices per bag ## Correctness Testing - ✅ Correctness tests pass for various embedding types (bfloat16, float32), shapes, and input distributions - ✅ Register usage reduction verified with NSight Compute - ✅ Linter passes ## Compatibility - ✅ No API/ABI changes Pull Request resolved: pytorch#167834 Approved by: https://github.com/ngimel, https://github.com/eqy

YyWangCS added 4 commits November 11, 2025 23:31

[ATEN][CUDA] Reduce GPU register pressure to improve torch.EmbeddingB…

9c75983

…ag performance

[Aten][CUDA]Update comments

0eb0dcb

Fix lint issue

643ac8f

Merge branch 'main' into YyWangCS/embedding_occupancy

3680b61

YyWangCS requested review from Aidyn-A, eqy and syed-ahmed as code owners November 14, 2025 13:44

pytorch-bot bot added the release notes: cuda release notes category label Nov 14, 2025

pytorchbot added the open source label Nov 14, 2025

eqy added ciflow/h100 ciflow/trunk Trigger trunk jobs on your pull request labels Nov 14, 2025

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Nov 14, 2025

eqy reviewed Nov 14, 2025

View reviewed changes

ngimel approved these changes Nov 14, 2025

View reviewed changes

eqy approved these changes Nov 14, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 15, 2025

pytorchmergebot added the merging label Nov 15, 2025

pytorchmergebot added the Merged label Nov 15, 2025

pytorchmergebot closed this Nov 15, 2025

pytorchmergebot removed the merging label Nov 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ATEN][CUDA] Reduce register pressure introduced by CUDA_KERNEL_ASSERT to improve torch.EmbeddingBag performance#167834

[ATEN][CUDA] Reduce register pressure introduced by CUDA_KERNEL_ASSERT to improve torch.EmbeddingBag performance#167834
YyWangCS wants to merge 4 commits intopytorch:mainfrom
YyWangCS:YyWangCS/embedding_occupancy

YyWangCS commented Nov 14, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

YyWangCS commented Nov 14, 2025

Uh oh!

pytorch-bot bot commented Nov 14, 2025

Uh oh!

eqy left a comment •

edited

Loading

Uh oh!

ngimel commented Nov 14, 2025

Uh oh!

eqy commented Nov 14, 2025

Uh oh!

eqy commented Nov 14, 2025

Uh oh!

ngimel commented Nov 14, 2025

Uh oh!

YyWangCS commented Nov 15, 2025

Uh oh!

YyWangCS commented Nov 15, 2025

Uh oh!

pytorchmergebot commented Nov 15, 2025

Uh oh!

pytorchmergebot commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

YyWangCS commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance Results

Motivation

Changes

Modified Kernels

Key Implementation Details

Code Changes

Testing & Compatibility

Performance Testing

Correctness Testing

Compatibility

Uh oh!

pytorch-bot bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167834

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

YyWangCS commented Nov 14, 2025

Uh oh!

pytorch-bot bot commented Nov 14, 2025

Uh oh!

eqy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngimel commented Nov 14, 2025

Uh oh!

eqy commented Nov 14, 2025

Uh oh!

eqy commented Nov 14, 2025

Uh oh!

ngimel commented Nov 14, 2025

Uh oh!

YyWangCS commented Nov 15, 2025

Uh oh!

YyWangCS commented Nov 15, 2025

Uh oh!

pytorchmergebot commented Nov 15, 2025

Merge started

Uh oh!

pytorchmergebot commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

YyWangCS commented Nov 14, 2025 •

edited

Loading

pytorch-bot bot commented Nov 14, 2025 •

edited

Loading

eqy left a comment •

edited

Loading