[inductor] more aggressive mix order reduction by shunting314 · Pull Request #166382 · pytorch/pytorch

shunting314 · 2025-10-28T07:09:06Z

Stack from ghstack (oldest at bottom):

More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela

[ghstack-poisoned]

pytorch-bot · 2025-10-28T07:09:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166382

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm][CI] Machines under the label linux.rocm.gpu.2, label linux.rocm.gpu.4, linux.rocm.gpu.gfx1100 are undergoing maintenance.

✅ No Failures

As of commit 2e83789 with merge base b2a0f90 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 0fc174e Pull Request resolved: #166382

More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

shunting314 · 2025-10-29T22:24:17Z

Checked the models failing accuracy test.

For deit_base_distilled_patch16_224, I can repro on my h100. But the fused kernel looks correct: https://gist.github.com/shunting314/428f36ad11c7da9731113159f24c3bb2 . (unfused kernel for reference: https://gist.github.com/shunting314/5566f845acf85676bf6606cec94cb4f8 , https://gist.github.com/shunting314/3618b566b2b76a08845565f5e2d43157). Also using --float rather than --amp solve the issue. I'll just raise tolerance to fix. Similar to vit_base_patch16_siglip_256 .

beit_base_patch16_224 is quite tricky. I can not repro on H100 dev server. While I can repro on the A10G used to run CI jobs. The issue is reproed even with --float. But if I change the wrapper and run both fused and non-fused kernel and compare the result, they are very close:

# first iteration
(out[0] - side[0]).abs().max()=tensor(2.7940e-09, device='cuda:0')
(out[1] - side[1]).abs().max()=tensor(4.6566e-10, device='cuda:0')
(out[0] - side[0]).abs().max()=tensor(8.1491e-10, device='cuda:0')
(out[1] - side[1]).abs().max()=tensor(1.1642e-10, device='cuda:0')
# second iteration
(out[0] - side[0]).abs().max()=tensor(2.3283e-09, device='cuda:0')
(out[1] - side[1]).abs().max()=tensor(3.4925e-10, device='cuda:0')
(out[0] - side[0]).abs().max()=tensor(2.3283e-09, device='cuda:0')
(out[1] - side[1]).abs().max()=tensor(8.1491e-10, device='cuda:0')

These small difference should be acceptable. Maybe small difference get amplified by the later operators.

More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: 798af2d Pull Request resolved: #166382

More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela [ghstack-poisoned]

ghstack-source-id: 43ad97e Pull Request resolved: #166382

More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela [ghstack-poisoned]

pytorchmergebot · 2025-11-01T06:04:35Z

Starting merge as part of PR stack under #166697

pytorchmergebot · 2025-11-01T22:03:39Z

Starting merge as part of PR stack under #166697

…eduction heuristics (#166461) split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction. <img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" /> For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8) Pull Request resolved: #166461 Approved by: https://github.com/jansel, https://github.com/v0i0 ghstack dependencies: #166053, #166382

Pull Request resolved: #166585 Approved by: https://github.com/jansel, https://github.com/PaulZhang12 ghstack dependencies: #166053, #166382, #166461

Pull Request resolved: #166675 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #166053, #166382, #166461, #166585

…ipts (#166697) It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features. E.g. for mix-order-reduction, by running the following command ``` python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}' ``` I get following output: ``` Geomean speedup for benchmark RMSNormBackward eager 11 data points compiled 11 data points, 15.82x speedup quack 11 data points, 15.45x speedup liger 11 data points, 14.06x speedup compiled-no-fusion 11 data points, 10.26x speedup ``` The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative). The new curve also shows up in the figure: <img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" /> Pull Request resolved: #166697 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #166053, #166382, #166461, #166585, #166675

Summary: More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case. X-link: pytorch/pytorch#166382 Approved by: https://github.com/jansel, https://github.com/v0i0 ghstack dependencies: #166053 Reviewed By: donigian Differential Revision: D86056478 fbshipit-source-id: 6a1561fb54d450b69b08d41c1836c0361577e8f4

More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case. Pull Request resolved: pytorch#166382 Approved by: https://github.com/jansel, https://github.com/v0i0 ghstack dependencies: pytorch#166053

…eduction heuristics (pytorch#166461) split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction. <img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" /> For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8) Pull Request resolved: pytorch#166461 Approved by: https://github.com/jansel, https://github.com/v0i0 ghstack dependencies: pytorch#166053, pytorch#166382

Pull Request resolved: pytorch#166585 Approved by: https://github.com/jansel, https://github.com/PaulZhang12 ghstack dependencies: pytorch#166053, pytorch#166382, pytorch#166461

Pull Request resolved: pytorch#166675 Approved by: https://github.com/BoyuanFeng ghstack dependencies: pytorch#166053, pytorch#166382, pytorch#166461, pytorch#166585

…ipts (pytorch#166697) It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features. E.g. for mix-order-reduction, by running the following command ``` python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}' ``` I get following output: ``` Geomean speedup for benchmark RMSNormBackward eager 11 data points compiled 11 data points, 15.82x speedup quack 11 data points, 15.45x speedup liger 11 data points, 14.06x speedup compiled-no-fusion 11 data points, 10.26x speedup ``` The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative). The new curve also shows up in the figure: <img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" /> Pull Request resolved: pytorch#166697 Approved by: https://github.com/BoyuanFeng ghstack dependencies: pytorch#166053, pytorch#166382, pytorch#166461, pytorch#166585, pytorch#166675

ghstack-source-id: 79ea74c Pull Request resolved: pytorch/pytorch#166382

[inductor] more aggressive mix order reduction

d96c9b4

[ghstack-poisoned]

shunting314 mentioned this pull request Oct 28, 2025

[inductor] track reduction before splitting #166053

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Oct 28, 2025

shunting314 added a commit that referenced this pull request Oct 28, 2025

[inductor] more aggressive mix order reduction

0baab46

ghstack-source-id: 0fc174e Pull Request resolved: #166382

shunting314 requested review from eellison and jansel October 28, 2025 07:10

shunting314 mentioned this pull request Oct 28, 2025

[inductor] Make mix-order-reduction split size not depends on split-reduction heuristics #166461

Closed

shunting314 removed request for eellison and jansel October 28, 2025 22:55

shunting314 added the topic: not user facing topic category label Oct 29, 2025

shunting314 added a commit that referenced this pull request Oct 29, 2025

[inductor] more aggressive mix order reduction

ac60477

ghstack-source-id: 798af2d Pull Request resolved: #166382

pytorch-bot bot added the module: dynamo label Oct 29, 2025

shunting314 added a commit that referenced this pull request Oct 29, 2025

[inductor] more aggressive mix order reduction

27c7466

ghstack-source-id: 43ad97e Pull Request resolved: #166382

shunting314 mentioned this pull request Oct 29, 2025

[Inductor] mix order reduction heuristics and tuning #166585

Closed

shunting314 added 2 commits October 29, 2025 18:09

shunting314 requested review from eellison, jansel and v0i0 October 30, 2025 17:29

jansel approved these changes Oct 30, 2025

View reviewed changes

This was referenced Oct 30, 2025

[inductor] coordesc not tune XBLOCK for mix-order-reduction #166669

Closed

report geomean for norm bwd benchmarking #166675

Closed

v0i0 approved these changes Oct 31, 2025

View reviewed changes

shunting314 mentioned this pull request Oct 31, 2025

add a curve for customized compilation in the kernel benchmarking scripts #166697

Closed

pytorchmergebot closed this in 0573747 Nov 1, 2025

pytorchmergebot added the Merged label Nov 1, 2025

pytorchmergebot pushed a commit that referenced this pull request Nov 1, 2025

report geomean for norm bwd benchmarking (#166675)

a19e92d

Pull Request resolved: #166675 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #166053, #166382, #166461, #166585

Khanaksahu pushed a commit to Khanaksahu/pytorch that referenced this pull request Nov 17, 2025

[inductor] more aggressive mix order reduction

5c6e376

ghstack-source-id: 79ea74c Pull Request resolved: pytorch/pytorch#166382

github-actions bot deleted the gh/shunting314/250/head branch December 2, 2025 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] more aggressive mix order reduction#166382

[inductor] more aggressive mix order reduction#166382
shunting314 wants to merge 7 commits intogh/shunting314/250/basefrom
gh/shunting314/250/head

shunting314 commented Oct 28, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 28, 2025 •

edited

Loading

Uh oh!

shunting314 commented Oct 29, 2025 •

edited

Loading

Uh oh!

pytorchmergebot commented Nov 1, 2025

Uh oh!

pytorchmergebot commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shunting314 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166382

❗ 1 Active SEVs

✅ No Failures

Uh oh!

shunting314 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorchmergebot commented Nov 1, 2025

Uh oh!

pytorchmergebot commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shunting314 commented Oct 28, 2025 •

edited

Loading

pytorch-bot bot commented Oct 28, 2025 •

edited

Loading

shunting314 commented Oct 29, 2025 •

edited

Loading