[ROCm] roll kernel as grid stride loop by PaulMullowney · Pull Request #169474 · pytorch/pytorch

PaulMullowney · 2025-12-03T16:49:29Z

Reimplement the roll kernel as a grid stride loop. On AMD devices, we see launch failures in the original version when gridDim.x*blockDim.x exceeds 4294967295. This implementation should work and be performant on both AMD and Nvidia devices. The issue can be seen on AMD devices with the following small repro:

import torch
N = 21913096
input_tensor_torch = torch.randn(1, 2, N, 98, device='cuda')
output = input_tensor_torch.roll(-1, dims=1)
input_tensor_torch_cpu = input_tensor_torch.cpu()
output_cpu = input_tensor_torch_cpu.roll(-1, dims=1)
assert torch.equal(output.cpu(), output_cpu)

Gives:
torch.AcceleratorError: HIP error: invalid configuration argument

If you set N=21913095, the original version of the kernel runs successfully.

Performance (averaged across 20 invocations) on an MI325x:
N Original (us) Grid Stride (us)
21913 12.5 12.4
219130 128.9 99.4
2191309 1286 1068
21913095 12381 10168

Fixes ROCm#2631

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

pytorch-bot · 2025-12-03T16:49:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169474

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 8ed9462 with merge base 7c593b9 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx942.4) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-12-03T16:49:38Z

The committers listed above are authorized under a signed CLA.

✅ login: PaulMullowney / name: Paul Mullowney (034a836, 09e711f, 1515e8b, 5c51fe1, 8ed9462, 9d1f7ea, fed3241)

aten/src/ATen/native/cuda/TensorTransformations.cu

cherry-pick of pytorch#169474

pytorch-bot · 2025-12-04T18:37:18Z

To add the ciflow label ciflow/rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2025-12-04T18:37:18Z

To add the ciflow label ciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2025-12-04T18:37:54Z

To add the ciflow label ciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2025-12-04T18:37:54Z

To add the ciflow label ciflow/rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

aten/src/ATen/native/cuda/TensorTransformations.cu

jerrymannil · 2025-12-09T23:52:27Z

@pytorchbot merge

pytorchmergebot · 2025-12-09T23:54:19Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Reimplement the roll kernel as a grid stride loop. On AMD devices, we see launch failures in the original version when gridDim.x*blockDim.x exceeds 4294967295. This implementation should work and be performant on both AMD and Nvidia devices. The issue can be seen on AMD devices with the following small repro: import torch N = 21913096 input_tensor_torch = torch.randn(1, 2, N, 98, device='cuda') output = input_tensor_torch.roll(-1, dims=1) input_tensor_torch_cpu = input_tensor_torch.cpu() output_cpu = input_tensor_torch_cpu.roll(-1, dims=1) assert torch.equal(output.cpu(), output_cpu) Gives: torch.AcceleratorError: HIP error: invalid configuration argument If you set N=21913095, the original version of the kernel runs successfully. Performance (averaged across 20 invocations) on an MI325x: N Original (us) Grid Stride (us) 21913 12.5 12.4 219130 128.9 99.4 2191309 1286 1068 21913095 12381 10168 Fixes ROCm#2631 Pull Request resolved: pytorch#169474 Approved by: https://github.com/jerrymannil, https://github.com/jeffdaily

jeffdaily · 2025-12-10T18:05:52Z

@pytorchbot merge -f "unrelated rocm failure, all other CI passing on trunk"

jeffdaily · 2025-12-10T18:06:29Z

stale browser window, sorry

pytorchmergebot · 2025-12-10T18:07:42Z

Can't merge closed PR #169474

PaulMullowney requested review from Aidyn-A, eqy and syed-ahmed as code owners December 3, 2025 16:49

pytorch-bot bot added the release notes: cuda release notes category label Dec 3, 2025

pytorchbot added the open source label Dec 3, 2025

PaulMullowney marked this pull request as draft December 3, 2025 17:08

jeffdaily added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 3, 2025

jerrymannil reviewed Dec 4, 2025

View reviewed changes

aten/src/ATen/native/cuda/TensorTransformations.cu Outdated Show resolved Hide resolved

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Dec 4, 2025

jerrymannil reviewed Dec 4, 2025

View reviewed changes

aten/src/ATen/native/cuda/TensorTransformations.cu Outdated Show resolved Hide resolved

jerrymannil reviewed Dec 4, 2025

View reviewed changes

aten/src/ATen/native/cuda/TensorTransformations.cu Show resolved Hide resolved

jerrymannil added a commit to ROCm/pytorch that referenced this pull request Dec 4, 2025

roll kernel as grid stride loop

6466e8d

cherry-pick of pytorch#169474

jerrymannil mentioned this pull request Dec 4, 2025

roll kernel as grid stride loop ROCm/pytorch#2852

Merged

jerrymannil added module: cuda Related to torch.cuda, and CUDA support in general module: rocm AMD GPU support for Pytorch ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Dec 4, 2025

pytorch-bot bot removed ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm Trigger "default" config CI on ROCm labels Dec 4, 2025

jerrymannil added ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Dec 4, 2025

pytorch-bot bot removed ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Dec 4, 2025

jerrymannil added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 4, 2025

PaulMullowney added 3 commits December 8, 2025 11:45

Fix spacing

fed3241

Fix spacing

9d1f7ea

Fix

5c51fe1

PaulMullowney force-pushed the roll_cuda_grid_stride_impl branch from 69dff67 to 5c51fe1 Compare December 8, 2025 17:45

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Dec 8, 2025

Restoring the original version for Nvidia builds

034a836

jerrymannil reviewed Dec 9, 2025

View reviewed changes

aten/src/ATen/native/cuda/TensorTransformations.cu Outdated Show resolved Hide resolved

jerrymannil changed the title ~~roll kernel as grid stride loop~~ [ROCm] roll kernel as grid stride loop Dec 9, 2025

Restoring the original version for Nvidia builds

8ed9462

PaulMullowney force-pushed the roll_cuda_grid_stride_impl branch from 5317c40 to 8ed9462 Compare December 9, 2025 00:45

jerrymannil marked this pull request as ready for review December 9, 2025 19:14

jerrymannil added ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Dec 9, 2025

jerrymannil approved these changes Dec 9, 2025

View reviewed changes

jerrymannil requested a review from jeffdaily December 9, 2025 23:48

jeffdaily approved these changes Dec 9, 2025

View reviewed changes

pytorchmergebot added the merging label Dec 9, 2025

pytorchmergebot added the Merged label Dec 10, 2025

pytorchmergebot closed this in f6bf70b Dec 10, 2025

pytorchmergebot removed the merging label Dec 10, 2025

Conversation

PaulMullowney commented Dec 3, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169474

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

linux-foundation-easycla bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 4, 2025

Uh oh!

pytorch-bot bot commented Dec 4, 2025

Uh oh!

pytorch-bot bot commented Dec 4, 2025

Uh oh!

pytorch-bot bot commented Dec 4, 2025

Uh oh!

Uh oh!

jerrymannil commented Dec 9, 2025

Uh oh!

pytorchmergebot commented Dec 9, 2025

Merge started

Uh oh!

jeffdaily commented Dec 10, 2025

Uh oh!

jeffdaily commented Dec 10, 2025

Uh oh!

pytorchmergebot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

PaulMullowney commented Dec 3, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 3, 2025 •

edited

Loading

linux-foundation-easycla bot commented Dec 3, 2025 •

edited

Loading