[ROCm] new implementation of upsample_bilinear2d_backward#164572
[ROCm] new implementation of upsample_bilinear2d_backward#164572glen-amd wants to merge 5 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164572
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c9eb5ce with merge base 60ac039 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Unknown label
|
|
@jeffdaily / @jerrymannil / @amd-hhashemi - please review. Thanks. |
|
@jeffdaily - can you please review and add CI tags? Thanks. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
There was a problem hiding this comment.
Can't this zeroing be removed now that you're not using atomics?
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
…based one to remove `atomicAdd` operations, and it appears to deliver at least a 20× speedup
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
|
Successfully rebased |
3503eac to
c9eb5ce
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Changed the implementation from an output-based approach to an input-based one to remove
atomicAddoperations, and it appears to deliver at least a 20× speedup.The changes are from Yu-Yun YuYun.Chang@amd.com.
Summary: Refactor of the implementation of the
upsample_bilinear2d_backwardopertion on MI300X/MI325Xcompute_output_rangedevice function).Breakdown of the code changes
upsample_bilinear2d_backward_out_frameconst size_t o_numel = nc * width2 * height2;).const size_t i_numel = nc * height1 * width1;).upsample_bilinear2d_backward_out_cuda_template.input_pos) during the forward pass interpolationfastAtomicAdd4 times to add these values to 4 different (and potentially highly contended) memory locations in the input gradient tensor.compute_output_rangeto determine the small rectangular region of output pixels that influence the input's final gradient value.accscalar_t grad_sum = 0;).idata[index] = static_cast<scalar_t>(grad_sum);).Why performance gets boosted
atomicAddfrequently attempting to update the exact same memory location in the input gradient tensor at the same time.atomicAdd.idata[index] = static_cast<scalar_t>(grad_sum);can be perfectly coalesced by GPUs.static_cast<accscalar_t>(odata[output_idx])), these are highly cache-friendly, meaning the data for one thread is likely to be in the L1 or L2 cache already due to an access from a neighboring thread.Optimizations of
grid_sampler_2d_backwardwill be addressed in a separate PR.Doc for reference: (internal only) https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1162750701/PyTorch__grid_sampler_2d_backward
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd