Skip to content

[ATEN][CUDA] Reduce register pressure introduced by CUDA_KERNEL_ASSERT to improve torch.EmbeddingBag performance#167834

Closed
YyWangCS wants to merge 4 commits intopytorch:mainfrom
YyWangCS:YyWangCS/embedding_occupancy
Closed

[ATEN][CUDA] Reduce register pressure introduced by CUDA_KERNEL_ASSERT to improve torch.EmbeddingBag performance#167834
YyWangCS wants to merge 4 commits intopytorch:mainfrom
YyWangCS:YyWangCS/embedding_occupancy

Conversation

@YyWangCS
Copy link
Contributor

@YyWangCS YyWangCS commented Nov 14, 2025

Summary

This PR optimizes the CUDA kernels for torch.nn.EmbeddingBag by reducing GPU register pressure introduced by CUDA_KERNEL_ASSERT, which improves kernel occupancy and overall performance. The optimization separates input validation into a dedicated loop before the main processing loop, allowing the compiler to better optimize register allocation. By extensively testing on various GPUs and CUDA versions, torch.nn.EmbeddingBag performance improves by 29% to 111% with this PR.

Performance Results

The following table shows the performance improvements on various input distributions and GPUs. All benchmarks use PyTorch 2.9.0 compiled with CUDA 12.8.

Input Distribution Types (simulating recommendation system ID patterns):

  • random id: Randomly sampled embedding indices from the full vocabulary (uniform distribution)
  • one-hot: One ID appears with very high frequency across all bags, simulating a popular item in recommendation systems
  • multi-hot: Multiple IDs appear with high frequency across all bags, simulating multiple popular items in recommendation systems

Test Configuration:

  • Embedding shape: (5000000, 128) (5M vocabulary size, 128-dimensional embeddings)
  • Batch size: 2048 bags
  • Average bag size: 150 indices per bag
GPU Input Distribution Before (µs) After (µs) Speedup
H100 random id 162.4 105.9 1.53×
H100 one-hot 120.4 88.6 1.36×
H100 multi-hot 113.1 87.8 1.29×
H20 random id 278.6 132.2 2.11×
H20 one-hot 189.7 110.3 1.72×
H20 multi-hot 172.4 107.4 1.61×

Motivation

The original implementation performed bounds checking using CUDA_KERNEL_ASSERT inline within the main processing loop, which increased register pressure and limited GPU occupancy. From NSight Compute analysis on H20, using PyTorch 2.9 compiled with CUDA 12.8, removing the CUDA_KERNEL_ASSERT from the main loop with this PR increases the overall occupancy from 50% to 75%(registers per thread 52->40).

By separating validation into a dedicated loop, we:

  1. Reduce register pressure in the main loop: The validation loop uses minimal registers, allowing the compiler to optimize the main processing loop independently with better register allocation.
  2. Maintain correctness: All input validation is still performed, but in a more register-efficient manner.

Changes

Modified Kernels

  1. EmbeddingBag_updateOutputKernel_max: Added separate validation loop before main processing
  2. EmbeddingBag_updateOutputKernel_sum_mean: Added separate validation loop before main processing

Key Implementation Details

  • Separate validation loop: Input indices are validated in a dedicated loop that checks all indices before processing begins
  • No early exit: The validation loop intentionally avoids using break for early exit, as benchmarking showed that early exit degrades performance, possibly due to increased branch divergence and reduced instruction-level parallelism
  • Consistent error messages: Improved error message clarity for invalid input indices
  • Design choice: validation loop vs. separate kernel: We considered removing CUDA_KERNEL_ASSERT entirely and performing bounds checking in a separate GPU kernel, which would achieve even better performance (e.g., on H20 with random id distribution: 132.2 µs → 124.6 µs). However, this approach is harder to maintain as it requires coordinating two separate kernel launches and managing additional kernel launch overhead. Instead, we chose the current approach of using a separate validation loop within the same kernel, which provides a good balance between performance improvement and code maintainability.

Code Changes

// Separate validation loop reduces register pressure in the main loop below.
// No early exit (break) on invalid input as benchmarking shows it degrades performance.
bool has_invalid_index = false;
for (int64_t emb = begin; emb < end; emb++) {
  index_t input_idx = input[emb];
  has_invalid_index = has_invalid_index || (input_idx < 0 || input_idx >= numRows);
}
CUDA_KERNEL_ASSERT(!has_invalid_index && "Invalid input index in EmbeddingBag: index out of range [0, numRows)");

// Main processing loop (now with reduced register pressure)
for (int64_t emb = begin; emb < end; emb++) {
  // ... processing logic ...
}

Testing & Compatibility

Performance Testing

I conducted extensive performance testing across multiple configurations. All tests show significant performance improvements:

Tested CUDA Versions:

  • CUDA 12.6, 12.8, 13.0

Tested GPU Architectures:

  • A100, H20, H100

Tested Input Configurations:

  • Embedding shapes: Various sizes including [5000000, 128] and [128000, 4096]
  • Embedding dtypes: torch.float32, torch.float16
  • Input distributions: Random indices, one-hot (high-frequency single ID), and multi-hot (high-frequency multiple IDs) patterns, simulating recommendation system workloads
  • Input sizes: Average bag sizes of 150, 20, and 10 indices per bag

Correctness Testing

  • ✅ Correctness tests pass for various embedding types (bfloat16, float32), shapes, and input distributions
  • ✅ Register usage reduction verified with NSight Compute
  • ✅ Linter passes

Compatibility

  • ✅ No API/ABI changes

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167834

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 3680b61 with merge base 226850c (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: cuda release notes category label Nov 14, 2025
@YyWangCS
Copy link
Contributor Author

@eqy eqy added ciflow/h100 ciflow/trunk Trigger trunk jobs on your pull request labels Nov 14, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Nov 14, 2025
Copy link
Collaborator

@eqy eqy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you also explore e.g., moving the CUDA_KERNEL_ASSERT out of the mainloop but keeping the has_invalid_index check inside it since these appear to be orthogonal changes?

@ngimel
Copy link
Collaborator

ngimel commented Nov 14, 2025

@eqy you can't move CUDA_KERNEL_ASSERT out of the main loop because you'll ima on the read if CUDA_KERNEL_ASSERT would have fired

@eqy
Copy link
Collaborator

eqy commented Nov 14, 2025

@YyWangCS do you have the benchmark script you used for this? interested in playing around with it

@eqy
Copy link
Collaborator

eqy commented Nov 14, 2025

nvm, I vibecoded a benchmark script and tried out evaluating the check in the existing loop (with a branch to avoid IMA) and it made things worse lol

@ngimel
Copy link
Collaborator

ngimel commented Nov 14, 2025

Apparently branching itself is the problem

@YyWangCS
Copy link
Contributor Author

nvm, I vibecoded a benchmark script and tried out evaluating the check in the existing loop (with a branch to avoid IMA) and it made things worse lol

I have tested the following solutions, with the baseline time 278.6 µs as shown in the 4th row of the table:

  1. Our solution, which uses separate validation loop, and the latency is 132.2 µs
  2. Remove CUDA_KERNEL_ASSERT completely without any validation, the latency is 124.6 µs
  3. Use a separate validation loop, but use CUDA_KERNEL_ASSERT in the loop, latency is 148 µs
  4. Use break for early exit in the validation loop, I do not remember the exact latency number, but it is much worse than 132.2 µs.

This is why I choose the current solution. Besides, I also tested on different CUDA versions(12.6, 12.8, 13.0), and they have such performance issue.
cc @ngimel

@YyWangCS
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 15, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

This PR (#167834) was merged in d7782dd but it is still open, likely due to a Github bug, so mergebot is closing it manually. If you think this is a mistake, please feel free to reopen and contact Dev Infra.

AshiteshSingh pushed a commit to AshiteshSingh/pytorch that referenced this pull request Nov 15, 2025
…T to improve torch.EmbeddingBag performance (pytorch#167834)

# Summary

This PR optimizes the CUDA kernels for `torch.nn.EmbeddingBag` by reducing GPU register pressure introduced by `CUDA_KERNEL_ASSERT`, which improves kernel occupancy and overall performance. The optimization separates input validation into a dedicated loop before the main processing loop, allowing the compiler to better optimize register allocation. By extensively testing on various GPUs and CUDA versions, `torch.nn.EmbeddingBag` performance improves by 29% to 111% with this PR.

# Performance Results

The following table shows the performance improvements on various input distributions and GPUs. All benchmarks use PyTorch 2.9.0 compiled with CUDA 12.8.

**Input Distribution Types (simulating recommendation system ID patterns):**
- **random id**: Randomly sampled embedding indices from the full vocabulary (uniform distribution)
- **one-hot**: One ID appears with very high frequency across all bags, simulating a popular item in recommendation systems
- **multi-hot**: Multiple IDs appear with high frequency across all bags, simulating multiple popular items in recommendation systems

**Test Configuration:**
- Embedding shape: `(5000000, 128)` (5M vocabulary size, 128-dimensional embeddings)
- Batch size: 2048 bags
- Average bag size: 150 indices per bag

| GPU  | Input Distribution | Before (µs) | After (µs) | Speedup |
| ---- | ------------------ | ----------- | ---------- | ------- |
| H100 | random id          | 162.4       | 105.9      | 1.53×   |
| H100 | one-hot            | 120.4       | 88.6       | 1.36×   |
| H100 | multi-hot          | 113.1       | 87.8       | 1.29×   |
| H20  | random id          | 278.6       | 132.2      | 2.11×   |
| H20  | one-hot            | 189.7       | 110.3      | 1.72×   |
| H20  | multi-hot          | 172.4       | 107.4      | 1.61×   |

# Motivation

The original implementation performed bounds checking using `CUDA_KERNEL_ASSERT` inline within the main processing loop, which increased register pressure and limited GPU occupancy. From NSight Compute analysis on H20, using PyTorch 2.9 compiled with CUDA 12.8, removing the `CUDA_KERNEL_ASSERT` from the main loop with this PR increases the overall occupancy from 50% to 75%(registers per thread 52->40).

By separating validation into a dedicated loop, we:

1. **Reduce register pressure in the main loop**: The validation loop uses minimal registers, allowing the compiler to optimize the main processing loop independently with better register allocation.
2. **Maintain correctness**: All input validation is still performed, but in a more register-efficient manner.

# Changes

## Modified Kernels

1. **`EmbeddingBag_updateOutputKernel_max`**: Added separate validation loop before main processing
2. **`EmbeddingBag_updateOutputKernel_sum_mean`**: Added separate validation loop before main processing

## Key Implementation Details

- **Separate validation loop**: Input indices are validated in a dedicated loop that checks all indices before processing begins
- **No early exit**: The validation loop intentionally avoids using `break` for early exit, as benchmarking showed that early exit degrades performance, possibly due to increased branch divergence and reduced instruction-level parallelism
- **Consistent error messages**: Improved error message clarity for invalid input indices
- **Design choice: validation loop vs. separate kernel**: We considered removing `CUDA_KERNEL_ASSERT` entirely and performing bounds checking in a separate GPU kernel, which would achieve even better performance (e.g., on H20 with random id distribution: 132.2 µs → 124.6 µs). However, this approach is harder to maintain as it requires coordinating two separate kernel launches and managing additional kernel launch overhead. Instead, we chose the current approach of using a separate validation loop within the same kernel, which provides a good balance between performance improvement and code maintainability.

## Code Changes

```cpp
// Separate validation loop reduces register pressure in the main loop below.
// No early exit (break) on invalid input as benchmarking shows it degrades performance.
bool has_invalid_index = false;
for (int64_t emb = begin; emb < end; emb++) {
  index_t input_idx = input[emb];
  has_invalid_index = has_invalid_index || (input_idx < 0 || input_idx >= numRows);
}
CUDA_KERNEL_ASSERT(!has_invalid_index && "Invalid input index in EmbeddingBag: index out of range [0, numRows)");

// Main processing loop (now with reduced register pressure)
for (int64_t emb = begin; emb < end; emb++) {
  // ... processing logic ...
}
```

# Testing & Compatibility

## Performance Testing

I conducted extensive performance testing across multiple configurations. All tests show significant performance improvements:

**Tested CUDA Versions:**
- CUDA 12.6, 12.8, 13.0

**Tested GPU Architectures:**
- A100, H20, H100

**Tested Input Configurations:**
- **Embedding shapes**: Various sizes including `[5000000, 128]` and `[128000, 4096]`
- **Embedding dtypes**: `torch.float32`, `torch.float16`
- **Input distributions**: Random indices, one-hot (high-frequency single ID), and multi-hot (high-frequency multiple IDs) patterns, simulating recommendation system workloads
- **Input sizes**: Average bag sizes of 150, 20, and 10 indices per bag

## Correctness Testing

- ✅ Correctness tests pass for various embedding types (bfloat16, float32), shapes, and input distributions
- ✅ Register usage reduction verified with NSight Compute
- ✅ Linter passes

## Compatibility

- ✅ No API/ABI changes

Pull Request resolved: pytorch#167834
Approved by: https://github.com/ngimel, https://github.com/eqy
jsuarez5341 pushed a commit to PufferAI/pytorch that referenced this pull request Nov 15, 2025
…T to improve torch.EmbeddingBag performance (pytorch#167834)

# Summary

This PR optimizes the CUDA kernels for `torch.nn.EmbeddingBag` by reducing GPU register pressure introduced by `CUDA_KERNEL_ASSERT`, which improves kernel occupancy and overall performance. The optimization separates input validation into a dedicated loop before the main processing loop, allowing the compiler to better optimize register allocation. By extensively testing on various GPUs and CUDA versions, `torch.nn.EmbeddingBag` performance improves by 29% to 111% with this PR.

# Performance Results

The following table shows the performance improvements on various input distributions and GPUs. All benchmarks use PyTorch 2.9.0 compiled with CUDA 12.8.

**Input Distribution Types (simulating recommendation system ID patterns):**
- **random id**: Randomly sampled embedding indices from the full vocabulary (uniform distribution)
- **one-hot**: One ID appears with very high frequency across all bags, simulating a popular item in recommendation systems
- **multi-hot**: Multiple IDs appear with high frequency across all bags, simulating multiple popular items in recommendation systems

**Test Configuration:**
- Embedding shape: `(5000000, 128)` (5M vocabulary size, 128-dimensional embeddings)
- Batch size: 2048 bags
- Average bag size: 150 indices per bag

| GPU  | Input Distribution | Before (µs) | After (µs) | Speedup |
| ---- | ------------------ | ----------- | ---------- | ------- |
| H100 | random id          | 162.4       | 105.9      | 1.53×   |
| H100 | one-hot            | 120.4       | 88.6       | 1.36×   |
| H100 | multi-hot          | 113.1       | 87.8       | 1.29×   |
| H20  | random id          | 278.6       | 132.2      | 2.11×   |
| H20  | one-hot            | 189.7       | 110.3      | 1.72×   |
| H20  | multi-hot          | 172.4       | 107.4      | 1.61×   |

# Motivation

The original implementation performed bounds checking using `CUDA_KERNEL_ASSERT` inline within the main processing loop, which increased register pressure and limited GPU occupancy. From NSight Compute analysis on H20, using PyTorch 2.9 compiled with CUDA 12.8, removing the `CUDA_KERNEL_ASSERT` from the main loop with this PR increases the overall occupancy from 50% to 75%(registers per thread 52->40).

By separating validation into a dedicated loop, we:

1. **Reduce register pressure in the main loop**: The validation loop uses minimal registers, allowing the compiler to optimize the main processing loop independently with better register allocation.
2. **Maintain correctness**: All input validation is still performed, but in a more register-efficient manner.

# Changes

## Modified Kernels

1. **`EmbeddingBag_updateOutputKernel_max`**: Added separate validation loop before main processing
2. **`EmbeddingBag_updateOutputKernel_sum_mean`**: Added separate validation loop before main processing

## Key Implementation Details

- **Separate validation loop**: Input indices are validated in a dedicated loop that checks all indices before processing begins
- **No early exit**: The validation loop intentionally avoids using `break` for early exit, as benchmarking showed that early exit degrades performance, possibly due to increased branch divergence and reduced instruction-level parallelism
- **Consistent error messages**: Improved error message clarity for invalid input indices
- **Design choice: validation loop vs. separate kernel**: We considered removing `CUDA_KERNEL_ASSERT` entirely and performing bounds checking in a separate GPU kernel, which would achieve even better performance (e.g., on H20 with random id distribution: 132.2 µs → 124.6 µs). However, this approach is harder to maintain as it requires coordinating two separate kernel launches and managing additional kernel launch overhead. Instead, we chose the current approach of using a separate validation loop within the same kernel, which provides a good balance between performance improvement and code maintainability.

## Code Changes

```cpp
// Separate validation loop reduces register pressure in the main loop below.
// No early exit (break) on invalid input as benchmarking shows it degrades performance.
bool has_invalid_index = false;
for (int64_t emb = begin; emb < end; emb++) {
  index_t input_idx = input[emb];
  has_invalid_index = has_invalid_index || (input_idx < 0 || input_idx >= numRows);
}
CUDA_KERNEL_ASSERT(!has_invalid_index && "Invalid input index in EmbeddingBag: index out of range [0, numRows)");

// Main processing loop (now with reduced register pressure)
for (int64_t emb = begin; emb < end; emb++) {
  // ... processing logic ...
}
```

# Testing & Compatibility

## Performance Testing

I conducted extensive performance testing across multiple configurations. All tests show significant performance improvements:

**Tested CUDA Versions:**
- CUDA 12.6, 12.8, 13.0

**Tested GPU Architectures:**
- A100, H20, H100

**Tested Input Configurations:**
- **Embedding shapes**: Various sizes including `[5000000, 128]` and `[128000, 4096]`
- **Embedding dtypes**: `torch.float32`, `torch.float16`
- **Input distributions**: Random indices, one-hot (high-frequency single ID), and multi-hot (high-frequency multiple IDs) patterns, simulating recommendation system workloads
- **Input sizes**: Average bag sizes of 150, 20, and 10 indices per bag

## Correctness Testing

- ✅ Correctness tests pass for various embedding types (bfloat16, float32), shapes, and input distributions
- ✅ Register usage reduction verified with NSight Compute
- ✅ Linter passes

## Compatibility

- ✅ No API/ABI changes

Pull Request resolved: pytorch#167834
Approved by: https://github.com/ngimel, https://github.com/eqy
Silv3S pushed a commit to Silv3S/pytorch that referenced this pull request Nov 18, 2025
…T to improve torch.EmbeddingBag performance (pytorch#167834)

# Summary

This PR optimizes the CUDA kernels for `torch.nn.EmbeddingBag` by reducing GPU register pressure introduced by `CUDA_KERNEL_ASSERT`, which improves kernel occupancy and overall performance. The optimization separates input validation into a dedicated loop before the main processing loop, allowing the compiler to better optimize register allocation. By extensively testing on various GPUs and CUDA versions, `torch.nn.EmbeddingBag` performance improves by 29% to 111% with this PR.

# Performance Results

The following table shows the performance improvements on various input distributions and GPUs. All benchmarks use PyTorch 2.9.0 compiled with CUDA 12.8.

**Input Distribution Types (simulating recommendation system ID patterns):**
- **random id**: Randomly sampled embedding indices from the full vocabulary (uniform distribution)
- **one-hot**: One ID appears with very high frequency across all bags, simulating a popular item in recommendation systems
- **multi-hot**: Multiple IDs appear with high frequency across all bags, simulating multiple popular items in recommendation systems

**Test Configuration:**
- Embedding shape: `(5000000, 128)` (5M vocabulary size, 128-dimensional embeddings)
- Batch size: 2048 bags
- Average bag size: 150 indices per bag

| GPU  | Input Distribution | Before (µs) | After (µs) | Speedup |
| ---- | ------------------ | ----------- | ---------- | ------- |
| H100 | random id          | 162.4       | 105.9      | 1.53×   |
| H100 | one-hot            | 120.4       | 88.6       | 1.36×   |
| H100 | multi-hot          | 113.1       | 87.8       | 1.29×   |
| H20  | random id          | 278.6       | 132.2      | 2.11×   |
| H20  | one-hot            | 189.7       | 110.3      | 1.72×   |
| H20  | multi-hot          | 172.4       | 107.4      | 1.61×   |

# Motivation

The original implementation performed bounds checking using `CUDA_KERNEL_ASSERT` inline within the main processing loop, which increased register pressure and limited GPU occupancy. From NSight Compute analysis on H20, using PyTorch 2.9 compiled with CUDA 12.8, removing the `CUDA_KERNEL_ASSERT` from the main loop with this PR increases the overall occupancy from 50% to 75%(registers per thread 52->40).

By separating validation into a dedicated loop, we:

1. **Reduce register pressure in the main loop**: The validation loop uses minimal registers, allowing the compiler to optimize the main processing loop independently with better register allocation.
2. **Maintain correctness**: All input validation is still performed, but in a more register-efficient manner.

# Changes

## Modified Kernels

1. **`EmbeddingBag_updateOutputKernel_max`**: Added separate validation loop before main processing
2. **`EmbeddingBag_updateOutputKernel_sum_mean`**: Added separate validation loop before main processing

## Key Implementation Details

- **Separate validation loop**: Input indices are validated in a dedicated loop that checks all indices before processing begins
- **No early exit**: The validation loop intentionally avoids using `break` for early exit, as benchmarking showed that early exit degrades performance, possibly due to increased branch divergence and reduced instruction-level parallelism
- **Consistent error messages**: Improved error message clarity for invalid input indices
- **Design choice: validation loop vs. separate kernel**: We considered removing `CUDA_KERNEL_ASSERT` entirely and performing bounds checking in a separate GPU kernel, which would achieve even better performance (e.g., on H20 with random id distribution: 132.2 µs → 124.6 µs). However, this approach is harder to maintain as it requires coordinating two separate kernel launches and managing additional kernel launch overhead. Instead, we chose the current approach of using a separate validation loop within the same kernel, which provides a good balance between performance improvement and code maintainability.

## Code Changes

```cpp
// Separate validation loop reduces register pressure in the main loop below.
// No early exit (break) on invalid input as benchmarking shows it degrades performance.
bool has_invalid_index = false;
for (int64_t emb = begin; emb < end; emb++) {
  index_t input_idx = input[emb];
  has_invalid_index = has_invalid_index || (input_idx < 0 || input_idx >= numRows);
}
CUDA_KERNEL_ASSERT(!has_invalid_index && "Invalid input index in EmbeddingBag: index out of range [0, numRows)");

// Main processing loop (now with reduced register pressure)
for (int64_t emb = begin; emb < end; emb++) {
  // ... processing logic ...
}
```

# Testing & Compatibility

## Performance Testing

I conducted extensive performance testing across multiple configurations. All tests show significant performance improvements:

**Tested CUDA Versions:**
- CUDA 12.6, 12.8, 13.0

**Tested GPU Architectures:**
- A100, H20, H100

**Tested Input Configurations:**
- **Embedding shapes**: Various sizes including `[5000000, 128]` and `[128000, 4096]`
- **Embedding dtypes**: `torch.float32`, `torch.float16`
- **Input distributions**: Random indices, one-hot (high-frequency single ID), and multi-hot (high-frequency multiple IDs) patterns, simulating recommendation system workloads
- **Input sizes**: Average bag sizes of 150, 20, and 10 indices per bag

## Correctness Testing

- ✅ Correctness tests pass for various embedding types (bfloat16, float32), shapes, and input distributions
- ✅ Register usage reduction verified with NSight Compute
- ✅ Linter passes

## Compatibility

- ✅ No API/ABI changes

Pull Request resolved: pytorch#167834
Approved by: https://github.com/ngimel, https://github.com/eqy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/h100 ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: cuda release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants