[Triton] [Inductor] Add a Blackwell specific Template for persistent matmul by njriasan · Pull Request #162916 · pytorch/pytorch

njriasan · 2025-09-14T17:30:54Z

Summary:
This adds the Triton Tutorial Matmul persistent matmul with device side TMA for Blackwell and adds it as a template option for blackwell. This uses newer Triton features such as automatic warp specialization and loop flattening, which while still containing flaws can improve performance on blackwell. This does not include the Epilogue subtiling section, as that will be a followup PR.

This PR doesn't include any tuning. I am doing a larger benchmarking run to determine the best initial configs for tuning and will open a followup PR with better defaults soon.

Test Plan:
Tested on a Blackwell machine with test_max_autotune.py and confirmed the new tests pass.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @mlazos

pytorch-bot · 2025-09-14T17:30:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162916

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 84da1da with merge base 6d64bc3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

NikhilAPatel

Looks good to me!

njriasan · 2025-09-15T19:03:03Z

@pytorchbot merge

njriasan · 2025-09-15T19:03:06Z

@pytorchbot merge

pytorchmergebot · 2025-09-15T19:05:02Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…matmul (pytorch#162916) Summary: This adds the Triton Tutorial Matmul persistent matmul with device side TMA for Blackwell and adds it as a template option for blackwell. This uses newer Triton features such as automatic warp specialization and loop flattening, which while still containing flaws can improve performance on blackwell. This does not include the Epilogue subtiling section, as that will be a followup PR. This PR doesn't include any tuning. I am doing a larger benchmarking run to determine the best initial configs for tuning and will open a followup PR with better defaults soon. Test Plan: Tested on a Blackwell machine with test_max_autotune.py and confirmed the new tests pass. Pull Request resolved: pytorch#162916 Approved by: https://github.com/NikhilAPatel

njriasan added 6 commits September 14, 2025 09:58

Added the tests

28976b9

Added the cuda_env change

4ac4240

Added mm.py changes

3ff1fe1

Added the utils file

3d2f6bc

Added the other util file

03cc511

Added template heuristic changes

b987ebd

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 14, 2025

njriasan added release notes: inductor ciflow/h100 Blackwell Specific failures or issues related to sm100 + Cuda arches labels Sep 14, 2025

njriasan requested review from NikhilAPatel, PaulZhang12 and jananisriram September 14, 2025 17:32

njriasan added 3 commits September 14, 2025 11:59

Fixed linting issues

383f836

lint more

9f059e3

Fixed a typo

84da1da

NikhilAPatel approved these changes Sep 15, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 15, 2025

pytorchmergebot added the merging label Sep 15, 2025

pytorchmergebot added the Merged label Sep 15, 2025

pytorchmergebot closed this in 955e195 Sep 15, 2025

pytorchmergebot removed the merging label Sep 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Triton] [Inductor] Add a Blackwell specific Template for persistent matmul#162916

[Triton] [Inductor] Add a Blackwell specific Template for persistent matmul#162916
njriasan wants to merge 9 commits intopytorch:mainfrom
njriasan:njriasan/persistent_tma_blackwell_template

njriasan commented Sep 14, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Sep 14, 2025 •

edited

Loading

Uh oh!

NikhilAPatel left a comment

Uh oh!

njriasan commented Sep 15, 2025

Uh oh!

njriasan commented Sep 15, 2025

Uh oh!

pytorchmergebot commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

njriasan commented Sep 14, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162916

✅ No Failures

Uh oh!

NikhilAPatel left a comment

Choose a reason for hiding this comment

Uh oh!

njriasan commented Sep 15, 2025

Uh oh!

njriasan commented Sep 15, 2025

Uh oh!

pytorchmergebot commented Sep 15, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

njriasan commented Sep 14, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 14, 2025 •

edited

Loading