Skip to content

[inductor] more aggressive mix order reduction#166382

Closed
shunting314 wants to merge 7 commits intogh/shunting314/250/basefrom
gh/shunting314/250/head
Closed

[inductor] more aggressive mix order reduction#166382
shunting314 wants to merge 7 commits intogh/shunting314/250/basefrom
gh/shunting314/250/head

Conversation

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 28, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166382

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 2e83789 with merge base b2a0f90 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
@shunting314
Copy link
Contributor Author

shunting314 commented Oct 29, 2025

Checked the models failing accuracy test.

For deit_base_distilled_patch16_224, I can repro on my h100. But the fused kernel looks correct: https://gist.github.com/shunting314/428f36ad11c7da9731113159f24c3bb2 . (unfused kernel for reference: https://gist.github.com/shunting314/5566f845acf85676bf6606cec94cb4f8 , https://gist.github.com/shunting314/3618b566b2b76a08845565f5e2d43157). Also using --float rather than --amp solve the issue. I'll just raise tolerance to fix. Similar to vit_base_patch16_siglip_256 .

beit_base_patch16_224 is quite tricky. I can not repro on H100 dev server. While I can repro on the A10G used to run CI jobs. The issue is reproed even with --float. But if I change the wrapper and run both fused and non-fused kernel and compare the result, they are very close:

# first iteration
(out[0] - side[0]).abs().max()=tensor(2.7940e-09, device='cuda:0')
(out[1] - side[1]).abs().max()=tensor(4.6566e-10, device='cuda:0')
(out[0] - side[0]).abs().max()=tensor(8.1491e-10, device='cuda:0')
(out[1] - side[1]).abs().max()=tensor(1.1642e-10, device='cuda:0')
# second iteration
(out[0] - side[0]).abs().max()=tensor(2.3283e-09, device='cuda:0')
(out[1] - side[1]).abs().max()=tensor(3.4925e-10, device='cuda:0')
(out[0] - side[0]).abs().max()=tensor(2.3283e-09, device='cuda:0')
(out[1] - side[1]).abs().max()=tensor(8.1491e-10, device='cuda:0')

These small difference should be acceptable. Maybe small difference get amplified by the later operators.

More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Oct 29, 2025
ghstack-source-id: 798af2d
Pull Request resolved: #166382
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Oct 29, 2025
ghstack-source-id: 43ad97e
Pull Request resolved: #166382
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela

[ghstack-poisoned]
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela

[ghstack-poisoned]
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #166697

1 similar comment
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #166697

pytorchmergebot pushed a commit that referenced this pull request Nov 1, 2025
…eduction heuristics (#166461)

split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction.

<img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" />

For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8)

Pull Request resolved: #166461
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: #166053, #166382
pytorchmergebot pushed a commit that referenced this pull request Nov 1, 2025
pytorchmergebot pushed a commit that referenced this pull request Nov 1, 2025
pytorchmergebot pushed a commit that referenced this pull request Nov 1, 2025
…ipts (#166697)

It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features.

E.g. for mix-order-reduction, by running the following command
```
python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}'
```

I get following output:
```
Geomean speedup for benchmark RMSNormBackward
  eager 11 data points
  compiled 11 data points, 15.82x speedup
  quack 11 data points, 15.45x speedup
  liger 11 data points, 14.06x speedup
  compiled-no-fusion 11 data points, 10.26x speedup
```

The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative).

The new curve also shows up in the figure:
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" />

Pull Request resolved: #166697
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #166053, #166382, #166461, #166585, #166675
meta-codesync bot pushed a commit to pytorch/benchmark that referenced this pull request Nov 3, 2025
Summary:
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case.

X-link: pytorch/pytorch#166382
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: #166053

Reviewed By: donigian

Differential Revision: D86056478

fbshipit-source-id: 6a1561fb54d450b69b08d41c1836c0361577e8f4
etaf pushed a commit to etaf/pytorch-inductor-xpu that referenced this pull request Nov 4, 2025
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case.

Pull Request resolved: pytorch#166382
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: pytorch#166053
etaf pushed a commit to etaf/pytorch-inductor-xpu that referenced this pull request Nov 4, 2025
…eduction heuristics (pytorch#166461)

split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction.

<img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" />

For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8)

Pull Request resolved: pytorch#166461
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: pytorch#166053, pytorch#166382
etaf pushed a commit to etaf/pytorch-inductor-xpu that referenced this pull request Nov 4, 2025
etaf pushed a commit to etaf/pytorch-inductor-xpu that referenced this pull request Nov 4, 2025
etaf pushed a commit to etaf/pytorch-inductor-xpu that referenced this pull request Nov 4, 2025
…ipts (pytorch#166697)

It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features.

E.g. for mix-order-reduction, by running the following command
```
python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}'
```

I get following output:
```
Geomean speedup for benchmark RMSNormBackward
  eager 11 data points
  compiled 11 data points, 15.82x speedup
  quack 11 data points, 15.45x speedup
  liger 11 data points, 14.06x speedup
  compiled-no-fusion 11 data points, 10.26x speedup
```

The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative).

The new curve also shows up in the figure:
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" />

Pull Request resolved: pytorch#166697
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: pytorch#166053, pytorch#166382, pytorch#166461, pytorch#166585, pytorch#166675
Khanaksahu pushed a commit to Khanaksahu/pytorch that referenced this pull request Nov 17, 2025
@github-actions github-actions bot deleted the gh/shunting314/250/head branch December 2, 2025 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants