Skip to content

[ROCm][inductor] autotune support for persistent reduction kernels#163908

Closed
naromero77amd wants to merge 9 commits intopytorch:mainfrom
ROCm:rocm_persistent_reduction_autotune
Closed

[ROCm][inductor] autotune support for persistent reduction kernels#163908
naromero77amd wants to merge 9 commits intopytorch:mainfrom
ROCm:rocm_persistent_reduction_autotune

Conversation

@naromero77amd
Copy link
Collaborator

@naromero77amd naromero77amd commented Sep 25, 2025

After the removal of want_no_x_dim for persistent reduction kernels, we can improve the autotuning setup for persistent reduction kernels.

Currently even with tuning enable, filtering will only try a single config in many cases. Avoid filtering with autotune mode, and override MAX_BLOCK limit. Also we always include tiny_config when autotuning is enabled.

Contributions from several members of the AMD Inductor and Triton teams: @jataylo @iupaikov-amd @AmdSampsa @xiaohuguo2023

cc @mlazos @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163908

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 90f2fe2 with merge base 382b015 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: inductor module: rocm AMD GPU support for Pytorch labels Sep 25, 2025
@naromero77amd naromero77amd marked this pull request as draft September 25, 2025 23:43
@pytorch-bot pytorch-bot bot added module: inductor module: rocm AMD GPU support for Pytorch labels Sep 25, 2025
@naromero77amd
Copy link
Collaborator Author

naromero77amd commented Sep 25, 2025

This is a redo of this PR: #162056

But this time branch was created in ROCm repo.

@jeffdaily jeffdaily added ciflow/inductor ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Sep 26, 2025
@naromero77amd naromero77amd force-pushed the rocm_persistent_reduction_autotune branch from f894ae7 to e549d1c Compare September 26, 2025 23:00
@pytorch-bot pytorch-bot bot removed ciflow/inductor ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Sep 26, 2025
@naromero77amd naromero77amd changed the title [ROCm][inductor] Autotune support for persistent reduction kernels [ROCm][inductor] autotune support for persistent reduction kernels Oct 2, 2025
@naromero77amd naromero77amd marked this pull request as ready for review October 10, 2025 01:49
@naromero77amd naromero77amd marked this pull request as draft October 10, 2025 01:50
@naromero77amd
Copy link
Collaborator Author

We demonstrate that the updated heuristics are generally applicable to other models such as the HuggingFace (HF) Inductor Dashboard benchmark suite.

Here is a high-level outline of the steps that were taken. We ran the HF benchmark suite from Inductor Dashboard using the command below:

python benchmarks/dynamo/runner.py --dtypes amp --suites huggingface --training --compilers inductor_no_cudagraphs --no-gh-comment --output-dir $outdir

Then we extract the relevant kernels and benchmark with and without the PR (baseline also used TORCHINDCTOR_MAX_AUTOTUNE_POINTWISE=1). There is a geomean speed-up of 1.08 X on over 400 kernels.

kernel_comp_per_pr.csv

@naromero77amd naromero77amd marked this pull request as ready for review October 16, 2025 00:58
@jeffdaily jeffdaily added release notes: rocm mandatorylabel ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Oct 17, 2025
@pytorch-bot pytorch-bot bot removed ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Oct 17, 2025
@jeffdaily jeffdaily added ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Oct 17, 2025
@naromero77amd
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 18, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@naromero77amd naromero77amd deleted the rocm_persistent_reduction_autotune branch October 18, 2025 16:20
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
…ytorch#163908)

After the removal of want_no_x_dim for persistent reduction kernels, we can improve the autotuning setup for persistent reduction kernels.

Currently even with tuning enable, filtering will only try a single config in many cases. Avoid filtering with autotune mode, and override MAX_BLOCK limit. Also we always include tiny_config when autotuning is enabled.

Contributions from several members of the AMD Inductor and Triton teams: @jataylo @iupaikov-amd @AmdSampsa @xiaohuguo2023

Pull Request resolved: pytorch#163908
Approved by: https://github.com/jansel, https://github.com/PaulZhang12
zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request Oct 22, 2025
…ytorch#163908)

After the removal of want_no_x_dim for persistent reduction kernels, we can improve the autotuning setup for persistent reduction kernels.

Currently even with tuning enable, filtering will only try a single config in many cases. Avoid filtering with autotune mode, and override MAX_BLOCK limit. Also we always include tiny_config when autotuning is enabled.

Contributions from several members of the AMD Inductor and Triton teams: @jataylo @iupaikov-amd @AmdSampsa @xiaohuguo2023

Pull Request resolved: pytorch#163908
Approved by: https://github.com/jansel, https://github.com/PaulZhang12
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Nov 26, 2025
…ports (#2807)

These are backports based on these upstream PRs. Cherrypicks were
performed when they where possible.

pytorch#163908 (persistent reduction
autotune)
pytorch#161280 (reduction)
pytorch#162053 (foreach)
pytorch#163197 (pointwise)
pytorch#166470 (pointwise config for
atomic add)

Also included are some additional customer-specific configs which were
not upstreamed but are in this backport to 2.9
#2723

Did not backport filter functions such as `
_maybe_filter_configs_for_tma_restrictions`

https://github.com/ROCm/pytorch/blob/release/2.9/torch/_inductor/runtime/triton_heuristics.py#L2614

---------

Co-authored-by: Jack Taylor <jack.taylor@amd.com>
Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Sampsa Riikonen <sriikone@amd.com>
Co-authored-by: AmdSampsa <sampsa.riikonen@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor module: rocm AMD GPU support for Pytorch open source release notes: inductor release notes: rocm mandatorylabel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants