[ROCm] roll kernel as grid stride loop#169474
[ROCm] roll kernel as grid stride loop#169474PaulMullowney wants to merge 7 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169474
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 8ed9462 with merge base 7c593b9 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cherry-pick of pytorch#169474
|
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
|
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
|
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
|
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
69dff67 to
5c51fe1
Compare
5317c40 to
8ed9462
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Reimplement the roll kernel as a grid stride loop. On AMD devices, we see launch failures in the original version when gridDim.x*blockDim.x exceeds 4294967295. This implementation should work and be performant on both AMD and Nvidia devices. The issue can be seen on AMD devices with the following small repro: import torch N = 21913096 input_tensor_torch = torch.randn(1, 2, N, 98, device='cuda') output = input_tensor_torch.roll(-1, dims=1) input_tensor_torch_cpu = input_tensor_torch.cpu() output_cpu = input_tensor_torch_cpu.roll(-1, dims=1) assert torch.equal(output.cpu(), output_cpu) Gives: torch.AcceleratorError: HIP error: invalid configuration argument If you set N=21913095, the original version of the kernel runs successfully. Performance (averaged across 20 invocations) on an MI325x: N Original (us) Grid Stride (us) 21913 12.5 12.4 219130 128.9 99.4 2191309 1286 1068 21913095 12381 10168 Fixes ROCm#2631 Pull Request resolved: pytorch#169474 Approved by: https://github.com/jerrymannil, https://github.com/jeffdaily
|
@pytorchbot merge -f "unrelated rocm failure, all other CI passing on trunk" |
|
stale browser window, sorry |
|
Can't merge closed PR #169474 |
Reimplement the roll kernel as a grid stride loop. On AMD devices, we see launch failures in the original version when gridDim.x*blockDim.x exceeds 4294967295. This implementation should work and be performant on both AMD and Nvidia devices. The issue can be seen on AMD devices with the following small repro:
import torch
N = 21913096
input_tensor_torch = torch.randn(1, 2, N, 98, device='cuda')
output = input_tensor_torch.roll(-1, dims=1)
input_tensor_torch_cpu = input_tensor_torch.cpu()
output_cpu = input_tensor_torch_cpu.roll(-1, dims=1)
assert torch.equal(output.cpu(), output_cpu)
Gives:
torch.AcceleratorError: HIP error: invalid configuration argument
If you set N=21913095, the original version of the kernel runs successfully.
Performance (averaged across 20 invocations) on an MI325x:
N Original (us) Grid Stride (us)
21913 12.5 12.4
219130 128.9 99.4
2191309 1286 1068
21913095 12381 10168
Fixes ROCm#2631
cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang