Skip to content

[Triton] [Inductor] Add a Blackwell specific Template for persistent matmul#162916

Closed
njriasan wants to merge 9 commits intopytorch:mainfrom
njriasan:njriasan/persistent_tma_blackwell_template
Closed

[Triton] [Inductor] Add a Blackwell specific Template for persistent matmul#162916
njriasan wants to merge 9 commits intopytorch:mainfrom
njriasan:njriasan/persistent_tma_blackwell_template

Conversation

@njriasan
Copy link
Contributor

@njriasan njriasan commented Sep 14, 2025

Summary:
This adds the Triton Tutorial Matmul persistent matmul with device side TMA for Blackwell and adds it as a template option for blackwell. This uses newer Triton features such as automatic warp specialization and loop flattening, which while still containing flaws can improve performance on blackwell. This does not include the Epilogue subtiling section, as that will be a followup PR.

This PR doesn't include any tuning. I am doing a larger benchmarking run to determine the best initial configs for tuning and will open a followup PR with better defaults soon.

Test Plan:
Tested on a Blackwell machine with test_max_autotune.py and confirmed the new tests pass.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @mlazos

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162916

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 84da1da with merge base 6d64bc3 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@NikhilAPatel NikhilAPatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@njriasan
Copy link
Contributor Author

@pytorchbot merge

1 similar comment
@njriasan
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 15, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…matmul (pytorch#162916)

Summary:
This adds the Triton Tutorial Matmul persistent matmul with device side TMA for Blackwell and adds it as a template option for blackwell. This uses newer Triton features such as automatic warp specialization and loop flattening, which while still containing flaws can improve performance on blackwell. This does not include the Epilogue subtiling section, as that will be a followup PR.

This PR doesn't include any tuning. I am doing a larger benchmarking run to determine the best initial configs for tuning and will open a followup PR with better defaults soon.

Test Plan:
Tested on a Blackwell machine with test_max_autotune.py and confirmed the new tests pass.

Pull Request resolved: pytorch#162916
Approved by: https://github.com/NikhilAPatel
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…matmul (pytorch#162916)

Summary:
This adds the Triton Tutorial Matmul persistent matmul with device side TMA for Blackwell and adds it as a template option for blackwell. This uses newer Triton features such as automatic warp specialization and loop flattening, which while still containing flaws can improve performance on blackwell. This does not include the Epilogue subtiling section, as that will be a followup PR.

This PR doesn't include any tuning. I am doing a larger benchmarking run to determine the best initial configs for tuning and will open a followup PR with better defaults soon.

Test Plan:
Tested on a Blackwell machine with test_max_autotune.py and confirmed the new tests pass.

Pull Request resolved: pytorch#162916
Approved by: https://github.com/NikhilAPatel
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…matmul (pytorch#162916)

Summary:
This adds the Triton Tutorial Matmul persistent matmul with device side TMA for Blackwell and adds it as a template option for blackwell. This uses newer Triton features such as automatic warp specialization and loop flattening, which while still containing flaws can improve performance on blackwell. This does not include the Epilogue subtiling section, as that will be a followup PR.

This PR doesn't include any tuning. I am doing a larger benchmarking run to determine the best initial configs for tuning and will open a followup PR with better defaults soon.

Test Plan:
Tested on a Blackwell machine with test_max_autotune.py and confirmed the new tests pass.

Pull Request resolved: pytorch#162916
Approved by: https://github.com/NikhilAPatel
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…matmul (pytorch#162916)

Summary:
This adds the Triton Tutorial Matmul persistent matmul with device side TMA for Blackwell and adds it as a template option for blackwell. This uses newer Triton features such as automatic warp specialization and loop flattening, which while still containing flaws can improve performance on blackwell. This does not include the Epilogue subtiling section, as that will be a followup PR.

This PR doesn't include any tuning. I am doing a larger benchmarking run to determine the best initial configs for tuning and will open a followup PR with better defaults soon.

Test Plan:
Tested on a Blackwell machine with test_max_autotune.py and confirmed the new tests pass.

Pull Request resolved: pytorch#162916
Approved by: https://github.com/NikhilAPatel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Blackwell Specific failures or issues related to sm100 + Cuda arches ciflow/h100 ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor release notes: inductor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants