[inductor] more aggressive mix order reduction#166382
[inductor] more aggressive mix order reduction#166382shunting314 wants to merge 7 commits intogh/shunting314/250/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166382
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 2e83789 with merge base b2a0f90 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
|
Checked the models failing accuracy test. For deit_base_distilled_patch16_224, I can repro on my h100. But the fused kernel looks correct: https://gist.github.com/shunting314/428f36ad11c7da9731113159f24c3bb2 . (unfused kernel for reference: https://gist.github.com/shunting314/5566f845acf85676bf6606cec94cb4f8 , https://gist.github.com/shunting314/3618b566b2b76a08845565f5e2d43157). Also using beit_base_patch16_224 is quite tricky. I can not repro on H100 dev server. While I can repro on the A10G used to run CI jobs. The issue is reproed even with These small difference should be acceptable. Maybe small difference get amplified by the later operators. |
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela [ghstack-poisoned]
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela [ghstack-poisoned]
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela [ghstack-poisoned]
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela [ghstack-poisoned]
|
Starting merge as part of PR stack under #166697 |
1 similar comment
|
Starting merge as part of PR stack under #166697 |
…eduction heuristics (#166461) split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction. <img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" /> For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8) Pull Request resolved: #166461 Approved by: https://github.com/jansel, https://github.com/v0i0 ghstack dependencies: #166053, #166382
Pull Request resolved: #166585 Approved by: https://github.com/jansel, https://github.com/PaulZhang12 ghstack dependencies: #166053, #166382, #166461
…ipts (#166697) It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features. E.g. for mix-order-reduction, by running the following command ``` python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}' ``` I get following output: ``` Geomean speedup for benchmark RMSNormBackward eager 11 data points compiled 11 data points, 15.82x speedup quack 11 data points, 15.45x speedup liger 11 data points, 14.06x speedup compiled-no-fusion 11 data points, 10.26x speedup ``` The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative). The new curve also shows up in the figure: <img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" /> Pull Request resolved: #166697 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #166053, #166382, #166461, #166585, #166675
Summary: More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case. X-link: pytorch/pytorch#166382 Approved by: https://github.com/jansel, https://github.com/v0i0 ghstack dependencies: #166053 Reviewed By: donigian Differential Revision: D86056478 fbshipit-source-id: 6a1561fb54d450b69b08d41c1836c0361577e8f4
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case. Pull Request resolved: pytorch#166382 Approved by: https://github.com/jansel, https://github.com/v0i0 ghstack dependencies: pytorch#166053
…eduction heuristics (pytorch#166461) split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction. <img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" /> For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8) Pull Request resolved: pytorch#166461 Approved by: https://github.com/jansel, https://github.com/v0i0 ghstack dependencies: pytorch#166053, pytorch#166382
Pull Request resolved: pytorch#166585 Approved by: https://github.com/jansel, https://github.com/PaulZhang12 ghstack dependencies: pytorch#166053, pytorch#166382, pytorch#166461
Pull Request resolved: pytorch#166675 Approved by: https://github.com/BoyuanFeng ghstack dependencies: pytorch#166053, pytorch#166382, pytorch#166461, pytorch#166585
…ipts (pytorch#166697) It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features. E.g. for mix-order-reduction, by running the following command ``` python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}' ``` I get following output: ``` Geomean speedup for benchmark RMSNormBackward eager 11 data points compiled 11 data points, 15.82x speedup quack 11 data points, 15.45x speedup liger 11 data points, 14.06x speedup compiled-no-fusion 11 data points, 10.26x speedup ``` The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative). The new curve also shows up in the figure: <img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" /> Pull Request resolved: pytorch#166697 Approved by: https://github.com/BoyuanFeng ghstack dependencies: pytorch#166053, pytorch#166382, pytorch#166461, pytorch#166585, pytorch#166675
ghstack-source-id: 79ea74c Pull Request resolved: pytorch/pytorch#166382
Stack from ghstack (oldest at bottom):
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela