[ATEN][CUDA] Reduce register pressure introduced by CUDA_KERNEL_ASSERT to improve torch.EmbeddingBag performance#167834
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167834
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 3680b61 with merge base 226850c ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
|
@eqy you can't move CUDA_KERNEL_ASSERT out of the main loop because you'll ima on the read if CUDA_KERNEL_ASSERT would have fired |
|
@YyWangCS do you have the benchmark script you used for this? interested in playing around with it |
|
nvm, I vibecoded a benchmark script and tried out evaluating the check in the existing loop (with a branch to avoid IMA) and it made things worse lol |
|
Apparently branching itself is the problem |
I have tested the following solutions, with the baseline time 278.6 µs as shown in the 4th row of the table:
This is why I choose the current solution. Besides, I also tested on different CUDA versions(12.6, 12.8, 13.0), and they have such performance issue. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…T to improve torch.EmbeddingBag performance (pytorch#167834) # Summary This PR optimizes the CUDA kernels for `torch.nn.EmbeddingBag` by reducing GPU register pressure introduced by `CUDA_KERNEL_ASSERT`, which improves kernel occupancy and overall performance. The optimization separates input validation into a dedicated loop before the main processing loop, allowing the compiler to better optimize register allocation. By extensively testing on various GPUs and CUDA versions, `torch.nn.EmbeddingBag` performance improves by 29% to 111% with this PR. # Performance Results The following table shows the performance improvements on various input distributions and GPUs. All benchmarks use PyTorch 2.9.0 compiled with CUDA 12.8. **Input Distribution Types (simulating recommendation system ID patterns):** - **random id**: Randomly sampled embedding indices from the full vocabulary (uniform distribution) - **one-hot**: One ID appears with very high frequency across all bags, simulating a popular item in recommendation systems - **multi-hot**: Multiple IDs appear with high frequency across all bags, simulating multiple popular items in recommendation systems **Test Configuration:** - Embedding shape: `(5000000, 128)` (5M vocabulary size, 128-dimensional embeddings) - Batch size: 2048 bags - Average bag size: 150 indices per bag | GPU | Input Distribution | Before (µs) | After (µs) | Speedup | | ---- | ------------------ | ----------- | ---------- | ------- | | H100 | random id | 162.4 | 105.9 | 1.53× | | H100 | one-hot | 120.4 | 88.6 | 1.36× | | H100 | multi-hot | 113.1 | 87.8 | 1.29× | | H20 | random id | 278.6 | 132.2 | 2.11× | | H20 | one-hot | 189.7 | 110.3 | 1.72× | | H20 | multi-hot | 172.4 | 107.4 | 1.61× | # Motivation The original implementation performed bounds checking using `CUDA_KERNEL_ASSERT` inline within the main processing loop, which increased register pressure and limited GPU occupancy. From NSight Compute analysis on H20, using PyTorch 2.9 compiled with CUDA 12.8, removing the `CUDA_KERNEL_ASSERT` from the main loop with this PR increases the overall occupancy from 50% to 75%(registers per thread 52->40). By separating validation into a dedicated loop, we: 1. **Reduce register pressure in the main loop**: The validation loop uses minimal registers, allowing the compiler to optimize the main processing loop independently with better register allocation. 2. **Maintain correctness**: All input validation is still performed, but in a more register-efficient manner. # Changes ## Modified Kernels 1. **`EmbeddingBag_updateOutputKernel_max`**: Added separate validation loop before main processing 2. **`EmbeddingBag_updateOutputKernel_sum_mean`**: Added separate validation loop before main processing ## Key Implementation Details - **Separate validation loop**: Input indices are validated in a dedicated loop that checks all indices before processing begins - **No early exit**: The validation loop intentionally avoids using `break` for early exit, as benchmarking showed that early exit degrades performance, possibly due to increased branch divergence and reduced instruction-level parallelism - **Consistent error messages**: Improved error message clarity for invalid input indices - **Design choice: validation loop vs. separate kernel**: We considered removing `CUDA_KERNEL_ASSERT` entirely and performing bounds checking in a separate GPU kernel, which would achieve even better performance (e.g., on H20 with random id distribution: 132.2 µs → 124.6 µs). However, this approach is harder to maintain as it requires coordinating two separate kernel launches and managing additional kernel launch overhead. Instead, we chose the current approach of using a separate validation loop within the same kernel, which provides a good balance between performance improvement and code maintainability. ## Code Changes ```cpp // Separate validation loop reduces register pressure in the main loop below. // No early exit (break) on invalid input as benchmarking shows it degrades performance. bool has_invalid_index = false; for (int64_t emb = begin; emb < end; emb++) { index_t input_idx = input[emb]; has_invalid_index = has_invalid_index || (input_idx < 0 || input_idx >= numRows); } CUDA_KERNEL_ASSERT(!has_invalid_index && "Invalid input index in EmbeddingBag: index out of range [0, numRows)"); // Main processing loop (now with reduced register pressure) for (int64_t emb = begin; emb < end; emb++) { // ... processing logic ... } ``` # Testing & Compatibility ## Performance Testing I conducted extensive performance testing across multiple configurations. All tests show significant performance improvements: **Tested CUDA Versions:** - CUDA 12.6, 12.8, 13.0 **Tested GPU Architectures:** - A100, H20, H100 **Tested Input Configurations:** - **Embedding shapes**: Various sizes including `[5000000, 128]` and `[128000, 4096]` - **Embedding dtypes**: `torch.float32`, `torch.float16` - **Input distributions**: Random indices, one-hot (high-frequency single ID), and multi-hot (high-frequency multiple IDs) patterns, simulating recommendation system workloads - **Input sizes**: Average bag sizes of 150, 20, and 10 indices per bag ## Correctness Testing - ✅ Correctness tests pass for various embedding types (bfloat16, float32), shapes, and input distributions - ✅ Register usage reduction verified with NSight Compute - ✅ Linter passes ## Compatibility - ✅ No API/ABI changes Pull Request resolved: pytorch#167834 Approved by: https://github.com/ngimel, https://github.com/eqy
…T to improve torch.EmbeddingBag performance (pytorch#167834) # Summary This PR optimizes the CUDA kernels for `torch.nn.EmbeddingBag` by reducing GPU register pressure introduced by `CUDA_KERNEL_ASSERT`, which improves kernel occupancy and overall performance. The optimization separates input validation into a dedicated loop before the main processing loop, allowing the compiler to better optimize register allocation. By extensively testing on various GPUs and CUDA versions, `torch.nn.EmbeddingBag` performance improves by 29% to 111% with this PR. # Performance Results The following table shows the performance improvements on various input distributions and GPUs. All benchmarks use PyTorch 2.9.0 compiled with CUDA 12.8. **Input Distribution Types (simulating recommendation system ID patterns):** - **random id**: Randomly sampled embedding indices from the full vocabulary (uniform distribution) - **one-hot**: One ID appears with very high frequency across all bags, simulating a popular item in recommendation systems - **multi-hot**: Multiple IDs appear with high frequency across all bags, simulating multiple popular items in recommendation systems **Test Configuration:** - Embedding shape: `(5000000, 128)` (5M vocabulary size, 128-dimensional embeddings) - Batch size: 2048 bags - Average bag size: 150 indices per bag | GPU | Input Distribution | Before (µs) | After (µs) | Speedup | | ---- | ------------------ | ----------- | ---------- | ------- | | H100 | random id | 162.4 | 105.9 | 1.53× | | H100 | one-hot | 120.4 | 88.6 | 1.36× | | H100 | multi-hot | 113.1 | 87.8 | 1.29× | | H20 | random id | 278.6 | 132.2 | 2.11× | | H20 | one-hot | 189.7 | 110.3 | 1.72× | | H20 | multi-hot | 172.4 | 107.4 | 1.61× | # Motivation The original implementation performed bounds checking using `CUDA_KERNEL_ASSERT` inline within the main processing loop, which increased register pressure and limited GPU occupancy. From NSight Compute analysis on H20, using PyTorch 2.9 compiled with CUDA 12.8, removing the `CUDA_KERNEL_ASSERT` from the main loop with this PR increases the overall occupancy from 50% to 75%(registers per thread 52->40). By separating validation into a dedicated loop, we: 1. **Reduce register pressure in the main loop**: The validation loop uses minimal registers, allowing the compiler to optimize the main processing loop independently with better register allocation. 2. **Maintain correctness**: All input validation is still performed, but in a more register-efficient manner. # Changes ## Modified Kernels 1. **`EmbeddingBag_updateOutputKernel_max`**: Added separate validation loop before main processing 2. **`EmbeddingBag_updateOutputKernel_sum_mean`**: Added separate validation loop before main processing ## Key Implementation Details - **Separate validation loop**: Input indices are validated in a dedicated loop that checks all indices before processing begins - **No early exit**: The validation loop intentionally avoids using `break` for early exit, as benchmarking showed that early exit degrades performance, possibly due to increased branch divergence and reduced instruction-level parallelism - **Consistent error messages**: Improved error message clarity for invalid input indices - **Design choice: validation loop vs. separate kernel**: We considered removing `CUDA_KERNEL_ASSERT` entirely and performing bounds checking in a separate GPU kernel, which would achieve even better performance (e.g., on H20 with random id distribution: 132.2 µs → 124.6 µs). However, this approach is harder to maintain as it requires coordinating two separate kernel launches and managing additional kernel launch overhead. Instead, we chose the current approach of using a separate validation loop within the same kernel, which provides a good balance between performance improvement and code maintainability. ## Code Changes ```cpp // Separate validation loop reduces register pressure in the main loop below. // No early exit (break) on invalid input as benchmarking shows it degrades performance. bool has_invalid_index = false; for (int64_t emb = begin; emb < end; emb++) { index_t input_idx = input[emb]; has_invalid_index = has_invalid_index || (input_idx < 0 || input_idx >= numRows); } CUDA_KERNEL_ASSERT(!has_invalid_index && "Invalid input index in EmbeddingBag: index out of range [0, numRows)"); // Main processing loop (now with reduced register pressure) for (int64_t emb = begin; emb < end; emb++) { // ... processing logic ... } ``` # Testing & Compatibility ## Performance Testing I conducted extensive performance testing across multiple configurations. All tests show significant performance improvements: **Tested CUDA Versions:** - CUDA 12.6, 12.8, 13.0 **Tested GPU Architectures:** - A100, H20, H100 **Tested Input Configurations:** - **Embedding shapes**: Various sizes including `[5000000, 128]` and `[128000, 4096]` - **Embedding dtypes**: `torch.float32`, `torch.float16` - **Input distributions**: Random indices, one-hot (high-frequency single ID), and multi-hot (high-frequency multiple IDs) patterns, simulating recommendation system workloads - **Input sizes**: Average bag sizes of 150, 20, and 10 indices per bag ## Correctness Testing - ✅ Correctness tests pass for various embedding types (bfloat16, float32), shapes, and input distributions - ✅ Register usage reduction verified with NSight Compute - ✅ Linter passes ## Compatibility - ✅ No API/ABI changes Pull Request resolved: pytorch#167834 Approved by: https://github.com/ngimel, https://github.com/eqy
…T to improve torch.EmbeddingBag performance (pytorch#167834) # Summary This PR optimizes the CUDA kernels for `torch.nn.EmbeddingBag` by reducing GPU register pressure introduced by `CUDA_KERNEL_ASSERT`, which improves kernel occupancy and overall performance. The optimization separates input validation into a dedicated loop before the main processing loop, allowing the compiler to better optimize register allocation. By extensively testing on various GPUs and CUDA versions, `torch.nn.EmbeddingBag` performance improves by 29% to 111% with this PR. # Performance Results The following table shows the performance improvements on various input distributions and GPUs. All benchmarks use PyTorch 2.9.0 compiled with CUDA 12.8. **Input Distribution Types (simulating recommendation system ID patterns):** - **random id**: Randomly sampled embedding indices from the full vocabulary (uniform distribution) - **one-hot**: One ID appears with very high frequency across all bags, simulating a popular item in recommendation systems - **multi-hot**: Multiple IDs appear with high frequency across all bags, simulating multiple popular items in recommendation systems **Test Configuration:** - Embedding shape: `(5000000, 128)` (5M vocabulary size, 128-dimensional embeddings) - Batch size: 2048 bags - Average bag size: 150 indices per bag | GPU | Input Distribution | Before (µs) | After (µs) | Speedup | | ---- | ------------------ | ----------- | ---------- | ------- | | H100 | random id | 162.4 | 105.9 | 1.53× | | H100 | one-hot | 120.4 | 88.6 | 1.36× | | H100 | multi-hot | 113.1 | 87.8 | 1.29× | | H20 | random id | 278.6 | 132.2 | 2.11× | | H20 | one-hot | 189.7 | 110.3 | 1.72× | | H20 | multi-hot | 172.4 | 107.4 | 1.61× | # Motivation The original implementation performed bounds checking using `CUDA_KERNEL_ASSERT` inline within the main processing loop, which increased register pressure and limited GPU occupancy. From NSight Compute analysis on H20, using PyTorch 2.9 compiled with CUDA 12.8, removing the `CUDA_KERNEL_ASSERT` from the main loop with this PR increases the overall occupancy from 50% to 75%(registers per thread 52->40). By separating validation into a dedicated loop, we: 1. **Reduce register pressure in the main loop**: The validation loop uses minimal registers, allowing the compiler to optimize the main processing loop independently with better register allocation. 2. **Maintain correctness**: All input validation is still performed, but in a more register-efficient manner. # Changes ## Modified Kernels 1. **`EmbeddingBag_updateOutputKernel_max`**: Added separate validation loop before main processing 2. **`EmbeddingBag_updateOutputKernel_sum_mean`**: Added separate validation loop before main processing ## Key Implementation Details - **Separate validation loop**: Input indices are validated in a dedicated loop that checks all indices before processing begins - **No early exit**: The validation loop intentionally avoids using `break` for early exit, as benchmarking showed that early exit degrades performance, possibly due to increased branch divergence and reduced instruction-level parallelism - **Consistent error messages**: Improved error message clarity for invalid input indices - **Design choice: validation loop vs. separate kernel**: We considered removing `CUDA_KERNEL_ASSERT` entirely and performing bounds checking in a separate GPU kernel, which would achieve even better performance (e.g., on H20 with random id distribution: 132.2 µs → 124.6 µs). However, this approach is harder to maintain as it requires coordinating two separate kernel launches and managing additional kernel launch overhead. Instead, we chose the current approach of using a separate validation loop within the same kernel, which provides a good balance between performance improvement and code maintainability. ## Code Changes ```cpp // Separate validation loop reduces register pressure in the main loop below. // No early exit (break) on invalid input as benchmarking shows it degrades performance. bool has_invalid_index = false; for (int64_t emb = begin; emb < end; emb++) { index_t input_idx = input[emb]; has_invalid_index = has_invalid_index || (input_idx < 0 || input_idx >= numRows); } CUDA_KERNEL_ASSERT(!has_invalid_index && "Invalid input index in EmbeddingBag: index out of range [0, numRows)"); // Main processing loop (now with reduced register pressure) for (int64_t emb = begin; emb < end; emb++) { // ... processing logic ... } ``` # Testing & Compatibility ## Performance Testing I conducted extensive performance testing across multiple configurations. All tests show significant performance improvements: **Tested CUDA Versions:** - CUDA 12.6, 12.8, 13.0 **Tested GPU Architectures:** - A100, H20, H100 **Tested Input Configurations:** - **Embedding shapes**: Various sizes including `[5000000, 128]` and `[128000, 4096]` - **Embedding dtypes**: `torch.float32`, `torch.float16` - **Input distributions**: Random indices, one-hot (high-frequency single ID), and multi-hot (high-frequency multiple IDs) patterns, simulating recommendation system workloads - **Input sizes**: Average bag sizes of 150, 20, and 10 indices per bag ## Correctness Testing - ✅ Correctness tests pass for various embedding types (bfloat16, float32), shapes, and input distributions - ✅ Register usage reduction verified with NSight Compute - ✅ Linter passes ## Compatibility - ✅ No API/ABI changes Pull Request resolved: pytorch#167834 Approved by: https://github.com/ngimel, https://github.com/eqy
Summary
This PR optimizes the CUDA kernels for
torch.nn.EmbeddingBagby reducing GPU register pressure introduced byCUDA_KERNEL_ASSERT, which improves kernel occupancy and overall performance. The optimization separates input validation into a dedicated loop before the main processing loop, allowing the compiler to better optimize register allocation. By extensively testing on various GPUs and CUDA versions,torch.nn.EmbeddingBagperformance improves by 29% to 111% with this PR.Performance Results
The following table shows the performance improvements on various input distributions and GPUs. All benchmarks use PyTorch 2.9.0 compiled with CUDA 12.8.
Input Distribution Types (simulating recommendation system ID patterns):
Test Configuration:
(5000000, 128)(5M vocabulary size, 128-dimensional embeddings)Motivation
The original implementation performed bounds checking using
CUDA_KERNEL_ASSERTinline within the main processing loop, which increased register pressure and limited GPU occupancy. From NSight Compute analysis on H20, using PyTorch 2.9 compiled with CUDA 12.8, removing theCUDA_KERNEL_ASSERTfrom the main loop with this PR increases the overall occupancy from 50% to 75%(registers per thread 52->40).By separating validation into a dedicated loop, we:
Changes
Modified Kernels
EmbeddingBag_updateOutputKernel_max: Added separate validation loop before main processingEmbeddingBag_updateOutputKernel_sum_mean: Added separate validation loop before main processingKey Implementation Details
breakfor early exit, as benchmarking showed that early exit degrades performance, possibly due to increased branch divergence and reduced instruction-level parallelismCUDA_KERNEL_ASSERTentirely and performing bounds checking in a separate GPU kernel, which would achieve even better performance (e.g., on H20 with random id distribution: 132.2 µs → 124.6 µs). However, this approach is harder to maintain as it requires coordinating two separate kernel launches and managing additional kernel launch overhead. Instead, we chose the current approach of using a separate validation loop within the same kernel, which provides a good balance between performance improvement and code maintainability.Code Changes
Testing & Compatibility
Performance Testing
I conducted extensive performance testing across multiple configurations. All tests show significant performance improvements:
Tested CUDA Versions:
Tested GPU Architectures:
Tested Input Configurations:
[5000000, 128]and[128000, 4096]torch.float32,torch.float16Correctness Testing
Compatibility