[inductor] Make mix-order-reduction split size not depends on split-reduction heuristics by shunting314 · Pull Request #166461 · pytorch/pytorch

shunting314 · 2025-10-28T22:44:56Z

Stack from ghstack (oldest at bottom):

split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction.

For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

…lit-reduction heuristics [ghstack-poisoned]

pytorch-bot · 2025-10-28T22:45:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166461

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm][CI] Machines under the label linux.rocm.gpu.2, label linux.rocm.gpu.4, linux.rocm.gpu.gfx1100 are undergoing maintenance.

✅ No Failures

As of commit 349c998 with merge base b2a0f90 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…lit-reduction heuristics ghstack-source-id: e469fc5 Pull Request resolved: #166461

…pends on split-reduction heuristics" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

…eduction heuristics ghstack-source-id: a1765bd Pull Request resolved: #166461

… on split-reduction heuristics" split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction. <img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" /> For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

…eduction heuristics ghstack-source-id: b10f51c Pull Request resolved: #166461

… on split-reduction heuristics" split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction. <img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" /> For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

…eduction heuristics ghstack-source-id: e921e67 Pull Request resolved: #166461

… on split-reduction heuristics" split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction. <img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" /> For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

pytorchmergebot · 2025-11-01T06:04:36Z

Starting merge as part of PR stack under #166697

pytorchmergebot · 2025-11-01T22:03:39Z

Starting merge as part of PR stack under #166697

Pull Request resolved: #166585 Approved by: https://github.com/jansel, https://github.com/PaulZhang12 ghstack dependencies: #166053, #166382, #166461

Pull Request resolved: #166675 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #166053, #166382, #166461, #166585

…ipts (#166697) It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features. E.g. for mix-order-reduction, by running the following command ``` python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}' ``` I get following output: ``` Geomean speedup for benchmark RMSNormBackward eager 11 data points compiled 11 data points, 15.82x speedup quack 11 data points, 15.45x speedup liger 11 data points, 14.06x speedup compiled-no-fusion 11 data points, 10.26x speedup ``` The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative). The new curve also shows up in the figure: <img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" /> Pull Request resolved: #166697 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #166053, #166382, #166461, #166585, #166675

…eduction heuristics (pytorch#166461) split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction. <img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" /> For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8) Pull Request resolved: pytorch#166461 Approved by: https://github.com/jansel, https://github.com/v0i0 ghstack dependencies: pytorch#166053, pytorch#166382

Pull Request resolved: pytorch#166585 Approved by: https://github.com/jansel, https://github.com/PaulZhang12 ghstack dependencies: pytorch#166053, pytorch#166382, pytorch#166461

Pull Request resolved: pytorch#166675 Approved by: https://github.com/BoyuanFeng ghstack dependencies: pytorch#166053, pytorch#166382, pytorch#166461, pytorch#166585

…ipts (pytorch#166697) It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features. E.g. for mix-order-reduction, by running the following command ``` python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}' ``` I get following output: ``` Geomean speedup for benchmark RMSNormBackward eager 11 data points compiled 11 data points, 15.82x speedup quack 11 data points, 15.45x speedup liger 11 data points, 14.06x speedup compiled-no-fusion 11 data points, 10.26x speedup ``` The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative). The new curve also shows up in the figure: <img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" /> Pull Request resolved: pytorch#166697 Approved by: https://github.com/BoyuanFeng ghstack dependencies: pytorch#166053, pytorch#166382, pytorch#166461, pytorch#166585, pytorch#166675

…eduction heuristics ghstack-source-id: 57d7f0e Pull Request resolved: pytorch/pytorch#166461

[wip][inductor] Make mix-order-reduction split size not depends on sp…

0a437d4

…lit-reduction heuristics [ghstack-poisoned]

This was referenced Oct 28, 2025

[inductor] track reduction before splitting #166053

Closed

[inductor] more aggressive mix order reduction #166382

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Oct 28, 2025

shunting314 added a commit that referenced this pull request Oct 28, 2025

[wip][inductor] Make mix-order-reduction split size not depends on sp…

ba8b7a9

…lit-reduction heuristics ghstack-source-id: e469fc5 Pull Request resolved: #166461

shunting314 added a commit that referenced this pull request Oct 29, 2025

[inductor] Make mix-order-reduction split size not depends on split-r…

be88092

…eduction heuristics ghstack-source-id: a1765bd Pull Request resolved: #166461

shunting314 changed the title ~~[wip][inductor] Make mix-order-reduction split size not depends on split-reduction heuristics~~ [inductor] Make mix-order-reduction split size not depends on split-reduction heuristics Oct 29, 2025

shunting314 added a commit that referenced this pull request Oct 29, 2025

[inductor] Make mix-order-reduction split size not depends on split-r…

de9cd6f

…eduction heuristics ghstack-source-id: b10f51c Pull Request resolved: #166461

shunting314 added the topic: not user facing topic category label Oct 29, 2025

shunting314 added a commit that referenced this pull request Oct 29, 2025

[inductor] Make mix-order-reduction split size not depends on split-r…

f1865e0

…eduction heuristics ghstack-source-id: e921e67 Pull Request resolved: #166461

shunting314 mentioned this pull request Oct 29, 2025

[Inductor] mix order reduction heuristics and tuning #166585

Closed

shunting314 added 2 commits October 29, 2025 18:09

shunting314 requested review from eellison, jansel and v0i0 October 30, 2025 17:29

jansel approved these changes Oct 30, 2025

View reviewed changes

This was referenced Oct 30, 2025

[inductor] coordesc not tune XBLOCK for mix-order-reduction #166669

Closed

report geomean for norm bwd benchmarking #166675

Closed

v0i0 approved these changes Oct 31, 2025

View reviewed changes

shunting314 mentioned this pull request Oct 31, 2025

add a curve for customized compilation in the kernel benchmarking scripts #166697

Closed

pytorchmergebot closed this in 04d6a6f Nov 1, 2025

pytorchmergebot added the Merged label Nov 1, 2025

pytorchmergebot pushed a commit that referenced this pull request Nov 1, 2025

report geomean for norm bwd benchmarking (#166675)

a19e92d

Pull Request resolved: #166675 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #166053, #166382, #166461, #166585

Khanaksahu pushed a commit to Khanaksahu/pytorch that referenced this pull request Nov 17, 2025

[inductor] Make mix-order-reduction split size not depends on split-r…

9c617c1

…eduction heuristics ghstack-source-id: 57d7f0e Pull Request resolved: pytorch/pytorch#166461

github-actions bot deleted the gh/shunting314/251/head branch December 2, 2025 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] Make mix-order-reduction split size not depends on split-reduction heuristics#166461

[inductor] Make mix-order-reduction split size not depends on split-reduction heuristics#166461
shunting314 wants to merge 7 commits intogh/shunting314/251/basefrom
gh/shunting314/251/head

shunting314 commented Oct 28, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 28, 2025 •

edited

Loading

Uh oh!

pytorchmergebot commented Nov 1, 2025

Uh oh!

pytorchmergebot commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shunting314 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166461

❗ 1 Active SEVs

✅ No Failures

Uh oh!

pytorchmergebot commented Nov 1, 2025

Uh oh!

pytorchmergebot commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shunting314 commented Oct 28, 2025 •

edited

Loading

pytorch-bot bot commented Oct 28, 2025 •

edited

Loading