[ROCm] new implementation of upsample_bilinear2d_backward by glen-amd · Pull Request #164572 · pytorch/pytorch

glen-amd · 2025-10-03T14:21:56Z

Changed the implementation from an output-based approach to an input-based one to remove atomicAdd operations, and it appears to deliver at least a 20× speedup.

The changes are from Yu-Yun YuYun.Chang@amd.com.

Summary: Refactor of the implementation of the `upsample_bilinear2d_backward` opertion on MI300X/MI325X

The original "scatter-add" approach
- Each thread, representing an output pixel, scattered gradient contributions to four input pixels, using costly atomic operations on MI300X/MI325X GPUs.
The new "gather-sum" approach
- Each thread is responsible for a single input pixel and gathers all relevant gradient contributions from a small, calculated region of the output tensor (done by the compute_output_range device function).

Breakdown of the code changes

Inversion of the parallelization strategy of the kernel function upsample_bilinear2d_backward_out_frame
- Originally, the main kernel loop was parallelized over the number of elements in the output gradient tensor (const size_t o_numel = nc * width2 * height2;).
  - Each thread processed one output pixel.
- The new loop is parallelized over the number of elements in the input gradient tensor (const size_t i_numel = nc * height1 * width1;).
  - Each thread is responsible for calculating the final gradient for a single input pixel.
- The kernel launch changes accordingly in the function upsample_bilinear2d_backward_out_cuda_template.
Added a device function for calculating the range of output pixels that could have possibly used that the input pixel (input_pos) during the forward pass interpolation
- This is essentially the mathematical inverse of the forward pass.
- This function tries to prune a thread's search space so that it only needs to inspect a small, local window of the output tensor.
Gradient calculation approach switching from "scatter-add" to "gather-sum"
- Scatter-add
  - For each output pixel, the thread calculated 4 gradient contributions and use fastAtomicAdd 4 times to add these values to 4 different (and potentially highly contended) memory locations in the input gradient tensor.
- Gather-sum
  - A thread responsible for one input pixel calls compute_output_range to determine the small rectangular region of output pixels that influence the input's final gradient value.
  - The thread iterates through this region, and for each output pixel in the regionre, it re-calculates the interpolation weights to determine the exact contribution to its specific input pixel.
  - All these contributions are accumulated into a private, per-thread register variable (accscalar_t grad_sum = 0;).
    - W/o any gloabl memory access, this accumulation is extremely fast.
  - When the loops are done, the thread performs a single, direct write (non-atomic) of the final summed gradient to its designated location in global memory (idata[index] = static_cast<scalar_t>(grad_sum);).

Why performance gets boosted

Analysis of the root cause of performance drop
- Ref. (internal only) - https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1140493327/PyTorch__upsample_bilinear2d_backward
First and foremost, elimination of the contention of atomic operations
- Many parallel threads called atomicAdd frequently attempting to update the exact same memory location in the input gradient tensor at the same time.
  - The GPU's memory controler has to serialize these operations, effectively nullifying the benefit of parallel capability at those contention points.
- MI300X/MI325X chiplet-based CDNA 3 architeture amplified the issue.
  - When contending threads reside on different XCDs, resolving the atomic operation requires high-latency coherence traffic across the Infinity Fabric interconnect.
- The implementation change eliminates hardware-level serialization and cross-chiplet coherence traffic caused by many atomicAdd.
Improved memory access pattern and locality
- Write coalescing
  - The regular sum writes idata[index] = static_cast<scalar_t>(grad_sum); can be perfectly coalesced by GPUs.
- Read locality
  - Even though there are many (potentially repeated) reads from the output tensor (static_cast<accscalar_t>(odata[output_idx])), these are highly cache-friendly, meaning the data for one thread is likely to be in the L1 or L2 cache already due to an access from a neighboring thread.
Trade-off: computation for memory synchronization
- The recalculation of interpolation weights fits well on high-computational-throughput modern GPUs like MI300X/MI325X.
- Removal of atomic operations avoids expensive memory synchronization.

Optimizations of grid_sampler_2d_backward will be addressed in a separate PR.
Doc for reference: (internal only) https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1162750701/PyTorch__grid_sampler_2d_backward

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2025-10-03T14:21:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164572

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c9eb5ce with merge base 60ac039 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-10-03T14:22:02Z

The committers listed above are authorized under a signed CLA.

✅ login: glen-amd / name: Glen Cao (2811dc3, 7327a1a, 79576b8, 8f3b8c7)
✅ login: glen-amd / name: glen-amd (c9eb5ce)
✅ login: seemethere / name: Eli Uriegas (c9eb5ce)

pytorch-bot · 2025-10-03T17:00:57Z

Unknown label ciflow/rocm-mi355.
Currently recognized labels are

ciflow/b200
ciflow/b200-symm-mem
ciflow/binaries
ciflow/binaries_libtorch
ciflow/binaries_wheel
ciflow/h100
ciflow/h100-cutlass-backend
ciflow/h100-distributed
ciflow/h100-symm-mem
ciflow/inductor
ciflow/inductor-cu126
ciflow/inductor-micro-benchmark
ciflow/inductor-micro-benchmark-cpu-x86
ciflow/inductor-perf-compare
ciflow/inductor-perf-test-nightly-rocm
ciflow/inductor-perf-test-nightly-x86-zen
ciflow/inductor-periodic
ciflow/inductor-rocm
ciflow/linux-aarch64
ciflow/mps
ciflow/nightly
ciflow/op-benchmark
ciflow/periodic
ciflow/periodic-rocm-mi300
ciflow/pull
ciflow/quantization-periodic
ciflow/riscv64
ciflow/rocm
ciflow/rocm-mi300
ciflow/s390
ciflow/slow
ciflow/torchbench
ciflow/triton_binaries
ciflow/trunk
ciflow/unstable
ciflow/vllm
ciflow/win-arm64
ciflow/xpu

glen-amd · 2025-10-08T00:00:16Z

@jeffdaily / @jerrymannil / @amd-hhashemi - please review. Thanks.

glen-amd · 2025-10-10T18:41:19Z

@jeffdaily - can you please review and add CI tags? Thanks.

jeffdaily · 2025-10-24T20:57:51Z

@pytorchbot merge

pytorchmergebot · 2025-10-24T20:59:56Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-10-24T21:06:57Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Lint / lintrunner-noclang-all / linux-job

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

amd-hhashemi

Can't this zeroing be removed now that you're not using atomics?

jeffdaily · 2025-10-24T21:54:09Z

@pytorchbot rebase

pytorchmergebot · 2025-10-24T21:55:49Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

…based one to remove `atomicAdd` operations, and it appears to deliver at least a 20× speedup

…and CUDA

… list

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>

pytorchmergebot · 2025-10-24T21:55:53Z

Successfully rebased fix_to_up_and_grid_sample_backward onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_to_up_and_grid_sample_backward && git pull --rebase)

glen-amd · 2025-10-24T22:06:35Z

Can't this zeroing be removed now that you're not using atomics?

Good call.
In order not to push more changes, I shall be addressing this in the PR for the Grid Sampler optimization (#165337).

jeffdaily · 2025-10-24T22:25:11Z

@pytorchbot merge

pytorchmergebot · 2025-10-24T22:27:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Ref.: pytorch#164572

pytorch-bot bot added the release notes: cuda release notes category label Oct 3, 2025

pytorchbot added the open source label Oct 3, 2025

jeffdaily changed the title ~~Changed the implementation from an output-based approach to an input-…~~ [ROCm] new implementation of upsample_bilinear2d_backward Oct 10, 2025

pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch labels Oct 10, 2025

jeffdaily added release notes: rocm mandatorylabel ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 and removed release notes: cuda release notes category labels Oct 10, 2025

pytorch-bot bot removed ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Oct 10, 2025

jeffdaily added ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 keep-going Don't stop on first failure, keep running tests until the end labels Oct 13, 2025

pruthvistony requested a review from jeffdaily October 14, 2025 16:33

jeffdaily approved these changes Oct 24, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 24, 2025

pytorchmergebot added the merging label Oct 24, 2025

pytorchmergebot removed the merging label Oct 24, 2025

amd-hhashemi reviewed Oct 24, 2025

View reviewed changes

glen-amd and others added 5 commits October 24, 2025 21:55

Changed the implementation from an output-based approach to an input-…

8f3b8c7

…based one to remove `atomicAdd` operations, and it appears to deliver at least a 20× speedup

Added conditional pre-compile directives for differentiation of ROCm …

7327a1a

…and CUDA

More granular var scopes and better var names

79576b8

Nesting preprocessor conditional directives inside a macro's argument…

2811dc3

… list

Update aten/src/ATen/native/cuda/UpSampleBilinear2d.cu

c9eb5ce

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>

pytorchmergebot force-pushed the fix_to_up_and_grid_sample_backward branch from 3503eac to c9eb5ce Compare October 24, 2025 21:55

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Oct 24, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 24, 2025

pytorchmergebot added the merging label Oct 24, 2025

pytorchmergebot closed this in 761f946 Oct 25, 2025

pytorchmergebot removed the merging label Oct 25, 2025

rocm-repo-management-api bot pushed a commit to ROCm/pytorch that referenced this pull request Oct 28, 2025

Optimized BiLiear 2D Up Sampling for AMD MI devices (#2729)

414d533

Ref.: pytorch#164572

rocm-repo-management-api bot pushed a commit to ROCm/pytorch that referenced this pull request Oct 28, 2025

Optimized BiLiear 2D Up Sampling for AMD MI devices (#2729)

b75e905

Ref.: pytorch#164572

rocm-repo-management-api bot pushed a commit to ROCm/pytorch that referenced this pull request Oct 28, 2025

Optimized BiLiear 2D Up Sampling for AMD MI devices (#2729)

1a7ccb5

Ref.: pytorch#164572

jeffdaily pushed a commit to ROCm/pytorch that referenced this pull request Nov 17, 2025

Optimized BiLiear 2D Up Sampling for AMD MI devices (#2729)

e18209d

Ref.: pytorch#164572

Conversation

glen-amd commented Oct 3, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary: Refactor of the implementation of the upsample_bilinear2d_backward opertion on MI300X/MI325X

Breakdown of the code changes

Why performance gets boosted

Uh oh!

pytorch-bot bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164572

✅ No Failures

Uh oh!

linux-foundation-easycla bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 3, 2025

Uh oh!

glen-amd commented Oct 8, 2025

Uh oh!

glen-amd commented Oct 10, 2025

Uh oh!

jeffdaily commented Oct 24, 2025

Uh oh!

pytorchmergebot commented Oct 24, 2025

Merge started

Uh oh!

pytorchmergebot commented Oct 24, 2025

Merge failed

Uh oh!

amd-hhashemi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffdaily commented Oct 24, 2025

Uh oh!

pytorchmergebot commented Oct 24, 2025

Uh oh!

pytorchmergebot commented Oct 24, 2025

Uh oh!

glen-amd commented Oct 24, 2025

Uh oh!

jeffdaily commented Oct 24, 2025

Uh oh!

pytorchmergebot commented Oct 24, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

glen-amd commented Oct 3, 2025 •

edited by pytorch-bot bot

Loading

Summary: Refactor of the implementation of the `upsample_bilinear2d_backward` opertion on MI300X/MI325X

pytorch-bot bot commented Oct 3, 2025 •

edited

Loading

linux-foundation-easycla bot commented Oct 3, 2025 •

edited

Loading

amd-hhashemi left a comment •

edited

Loading