[ROCm][inductor] autotune support for persistent reduction kernels by naromero77amd · Pull Request #163908 · pytorch/pytorch

naromero77amd · 2025-09-25T23:43:02Z

After the removal of want_no_x_dim for persistent reduction kernels, we can improve the autotuning setup for persistent reduction kernels.

Currently even with tuning enable, filtering will only try a single config in many cases. Avoid filtering with autotune mode, and override MAX_BLOCK limit. Also we always include tiny_config when autotuning is enabled.

Contributions from several members of the AMD Inductor and Triton teams: @jataylo @iupaikov-amd @AmdSampsa @xiaohuguo2023

cc @mlazos @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-09-25T23:43:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163908

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 90f2fe2 with merge base 382b015 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

naromero77amd · 2025-09-25T23:44:39Z

This is a redo of this PR: #162056

But this time branch was created in ROCm repo.

naromero77amd · 2025-10-16T00:57:59Z

We demonstrate that the updated heuristics are generally applicable to other models such as the HuggingFace (HF) Inductor Dashboard benchmark suite.

Here is a high-level outline of the steps that were taken. We ran the HF benchmark suite from Inductor Dashboard using the command below:

python benchmarks/dynamo/runner.py --dtypes amp --suites huggingface --training --compilers inductor_no_cudagraphs --no-gh-comment --output-dir $outdir

Then we extract the relevant kernels and benchmark with and without the PR (baseline also used TORCHINDCTOR_MAX_AUTOTUNE_POINTWISE=1). There is a geomean speed-up of 1.08 X on over 400 kernels.

kernel_comp_per_pr.csv

naromero77amd · 2025-10-18T05:24:33Z

@pytorchbot merge

pytorchmergebot · 2025-10-18T05:26:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@jataylo

…ytorch#163908) After the removal of want_no_x_dim for persistent reduction kernels, we can improve the autotuning setup for persistent reduction kernels. Currently even with tuning enable, filtering will only try a single config in many cases. Avoid filtering with autotune mode, and override MAX_BLOCK limit. Also we always include tiny_config when autotuning is enabled. Contributions from several members of the AMD Inductor and Triton teams: @jataylo @iupaikov-amd @AmdSampsa @xiaohuguo2023 Pull Request resolved: pytorch#163908 Approved by: https://github.com/jansel, https://github.com/PaulZhang12

@jataylo

…ytorch#163908) After the removal of want_no_x_dim for persistent reduction kernels, we can improve the autotuning setup for persistent reduction kernels. Currently even with tuning enable, filtering will only try a single config in many cases. Avoid filtering with autotune mode, and override MAX_BLOCK limit. Also we always include tiny_config when autotuning is enabled. Contributions from several members of the AMD Inductor and Triton teams: @jataylo @iupaikov-amd @AmdSampsa @xiaohuguo2023 Pull Request resolved: pytorch#163908 Approved by: https://github.com/jansel, https://github.com/PaulZhang12

…ports (#2807) These are backports based on these upstream PRs. Cherrypicks were performed when they where possible. pytorch#163908 (persistent reduction autotune) pytorch#161280 (reduction) pytorch#162053 (foreach) pytorch#163197 (pointwise) pytorch#166470 (pointwise config for atomic add) Also included are some additional customer-specific configs which were not upstreamed but are in this backport to 2.9 #2723 Did not backport filter functions such as ` _maybe_filter_configs_for_tma_restrictions` https://github.com/ROCm/pytorch/blob/release/2.9/torch/_inductor/runtime/triton_heuristics.py#L2614 --------- Co-authored-by: Jack Taylor <jack.taylor@amd.com> Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Sampsa Riikonen <sriikone@amd.com> Co-authored-by: AmdSampsa <sampsa.riikonen@amd.com>

pytorch-bot bot added module: inductor module: rocm AMD GPU support for Pytorch labels Sep 25, 2025

naromero77amd marked this pull request as draft September 25, 2025 23:43

naromero77amd added release notes: inductor and removed module: rocm AMD GPU support for Pytorch module: inductor labels Sep 25, 2025

pytorch-bot bot added module: inductor module: rocm AMD GPU support for Pytorch labels Sep 25, 2025

pytorchbot added the open source label Sep 25, 2025

jeffdaily added ciflow/inductor ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Sep 26, 2025

naromero77amd added 5 commits September 26, 2025 22:42

Add max_autotune_enabled boolean.

6c3f540

Conditionalize xblock values for HIP.

846013e

Create tiny_config.

9f19754

Do not filter configs when tuning.

dc58de9

Add tiny_config when autotune enabled.

e549d1c

naromero77amd force-pushed the rocm_persistent_reduction_autotune branch from f894ae7 to e549d1c Compare September 26, 2025 23:00

pytorch-bot bot removed ciflow/inductor ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Sep 26, 2025

pragupta mentioned this pull request Sep 30, 2025

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-24 ROCm/pytorch#2678

Merged

naromero77amd changed the title ~~[ROCm][inductor] Autotune support for persistent reduction kernels~~ [ROCm][inductor] autotune support for persistent reduction kernels Oct 2, 2025

naromero77amd marked this pull request as ready for review October 10, 2025 01:49

naromero77amd marked this pull request as draft October 10, 2025 01:50

naromero77amd marked this pull request as ready for review October 16, 2025 00:58

naromero77amd requested review from PaulZhang12, eellison and jansel October 16, 2025 16:41

naromero77amd and others added 2 commits October 17, 2025 14:28

Merge branch 'main' into rocm_persistent_reduction_autotune

0201c7a

Lint

cd6d1c0

jeffdaily added release notes: rocm mandatorylabel ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Oct 17, 2025

PaulZhang12 approved these changes Oct 17, 2025

View reviewed changes

Lint

90f2fe2

pytorch-bot bot removed ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Oct 17, 2025

jeffdaily added ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Oct 17, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 18, 2025

pytorchmergebot added the merging label Oct 18, 2025

pytorchmergebot added the Merged label Oct 18, 2025

pytorchmergebot closed this in a0948d4 Oct 18, 2025

pytorchmergebot removed the merging label Oct 18, 2025

naromero77amd deleted the rocm_persistent_reduction_autotune branch October 18, 2025 16:20

naromero77amd mentioned this pull request Oct 25, 2025

Expand on autotune support for persistent reduction kernels #162056

Closed

This was referenced Nov 4, 2025

[perf regression] Softmax perf regression after triton config changed intel/intel-xpu-backend-for-triton#5367

Closed

[xpu][fix] Use default warps for inner reductions on XPU #166107

Closed

naromero77amd mentioned this pull request Nov 15, 2025

[NO CP][release/2.7][ROCm][inductor] Inductor heuristic upstream backports ROCm/pytorch#2807

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm][inductor] autotune support for persistent reduction kernels#163908

[ROCm][inductor] autotune support for persistent reduction kernels#163908
naromero77amd wants to merge 9 commits intopytorch:mainfrom
ROCm:rocm_persistent_reduction_autotune

naromero77amd commented Sep 25, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 25, 2025 •

edited

Loading

Uh oh!

naromero77amd commented Sep 25, 2025 •

edited

Loading

Uh oh!

naromero77amd commented Oct 16, 2025

Uh oh!

naromero77amd commented Oct 18, 2025

Uh oh!

pytorchmergebot commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

naromero77amd commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163908

✅ No Failures

Uh oh!

naromero77amd commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naromero77amd commented Oct 16, 2025

Uh oh!

naromero77amd commented Oct 18, 2025

Uh oh!

pytorchmergebot commented Oct 18, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

naromero77amd commented Sep 25, 2025 •

edited

Loading

pytorch-bot bot commented Sep 25, 2025 •

edited

Loading

naromero77amd commented Sep 25, 2025 •

edited

Loading