Skip to content

[inductor] Make mix-order-reduction split size not depends on split-reduction heuristics#166461

Closed
shunting314 wants to merge 7 commits intogh/shunting314/251/basefrom
gh/shunting314/251/head
Closed

[inductor] Make mix-order-reduction split size not depends on split-reduction heuristics#166461
shunting314 wants to merge 7 commits intogh/shunting314/251/basefrom
gh/shunting314/251/head

Conversation

@shunting314
Copy link
Contributor

@shunting314 shunting314 commented Oct 28, 2025

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 28, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166461

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 349c998 with merge base b2a0f90 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

shunting314 added a commit that referenced this pull request Oct 28, 2025
…lit-reduction heuristics

ghstack-source-id: e469fc5
Pull Request resolved: #166461
…pends on split-reduction heuristics"

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Oct 29, 2025
…eduction heuristics

ghstack-source-id: a1765bd
Pull Request resolved: #166461
@shunting314 shunting314 changed the title [wip][inductor] Make mix-order-reduction split size not depends on split-reduction heuristics [inductor] Make mix-order-reduction split size not depends on split-reduction heuristics Oct 29, 2025
… on split-reduction heuristics"


split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction.

<img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" />

For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8)

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Oct 29, 2025
…eduction heuristics

ghstack-source-id: b10f51c
Pull Request resolved: #166461
@shunting314 shunting314 added the topic: not user facing topic category label Oct 29, 2025
… on split-reduction heuristics"


split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction.

<img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" />

For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8)

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Oct 29, 2025
…eduction heuristics

ghstack-source-id: e921e67
Pull Request resolved: #166461
… on split-reduction heuristics"


split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction.

<img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" />

For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8)

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
… on split-reduction heuristics"


split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction.

<img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" />

For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8)

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
… on split-reduction heuristics"


split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction.

<img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" />

For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8)

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #166697

1 similar comment
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #166697

pytorchmergebot pushed a commit that referenced this pull request Nov 1, 2025
pytorchmergebot pushed a commit that referenced this pull request Nov 1, 2025
pytorchmergebot pushed a commit that referenced this pull request Nov 1, 2025
…ipts (#166697)

It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features.

E.g. for mix-order-reduction, by running the following command
```
python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}'
```

I get following output:
```
Geomean speedup for benchmark RMSNormBackward
  eager 11 data points
  compiled 11 data points, 15.82x speedup
  quack 11 data points, 15.45x speedup
  liger 11 data points, 14.06x speedup
  compiled-no-fusion 11 data points, 10.26x speedup
```

The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative).

The new curve also shows up in the figure:
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" />

Pull Request resolved: #166697
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #166053, #166382, #166461, #166585, #166675
etaf pushed a commit to etaf/pytorch-inductor-xpu that referenced this pull request Nov 4, 2025
…eduction heuristics (pytorch#166461)

split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction.

<img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" />

For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8)

Pull Request resolved: pytorch#166461
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: pytorch#166053, pytorch#166382
etaf pushed a commit to etaf/pytorch-inductor-xpu that referenced this pull request Nov 4, 2025
etaf pushed a commit to etaf/pytorch-inductor-xpu that referenced this pull request Nov 4, 2025
etaf pushed a commit to etaf/pytorch-inductor-xpu that referenced this pull request Nov 4, 2025
…ipts (pytorch#166697)

It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features.

E.g. for mix-order-reduction, by running the following command
```
python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}'
```

I get following output:
```
Geomean speedup for benchmark RMSNormBackward
  eager 11 data points
  compiled 11 data points, 15.82x speedup
  quack 11 data points, 15.45x speedup
  liger 11 data points, 14.06x speedup
  compiled-no-fusion 11 data points, 10.26x speedup
```

The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative).

The new curve also shows up in the figure:
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" />

Pull Request resolved: pytorch#166697
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: pytorch#166053, pytorch#166382, pytorch#166461, pytorch#166585, pytorch#166675
Khanaksahu pushed a commit to Khanaksahu/pytorch that referenced this pull request Nov 17, 2025
…eduction heuristics

ghstack-source-id: 57d7f0e
Pull Request resolved: pytorch/pytorch#166461
@github-actions github-actions bot deleted the gh/shunting314/251/head branch December 2, 2025 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants