Native matmul by nullplay · Pull Request #157743 · pytorch/pytorch

nullplay · 2025-07-07T23:00:30Z

Implementation of #151705

This PR introduces the initial implementation of native tl.dot support in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates.

To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in #151705:

Basic support (this PR)
Lazy broadcasting for optimal performance (future PR)

Summary of This PR

This PR implements the basic functionality. It does not include lazy broadcasting, so the generated kernels may involve explicit tl.reshape and tl.trans operations before calling tl.dot, which introduces some overhead.

Notable Changes

Adds a new config flag: config.triton.enable_native_matmul
Introduces a new ops.dot IR node in Inductor and lowers aten.mm and aten.bmm to it when native matmul is enabled
Enforces tililng suitable for matmul when the native matmul flag is enabled
Implements code generation for ops.dot
Adds Triton autotuning heuristics: for now, I’ve copied the configuration from the existing matmul templates. However, this may not be optimal—it currently takes a long time to tune, and I think there must be a better way to tackle this.

@eellison @jansel @PaulZhang12 @shunting314

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela @mlazos

pytorch-bot · 2025-07-07T23:00:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157743

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm][CI] Machines under the label linux.rocm.gpu.2 are undergoing maintenance.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 8d48a11 with merge base b8be796 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable, unstable) (gh)
test_autocast_with_unsupported_type

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-07-07T23:00:36Z

The committers listed above are authorized under a signed CLA.

✅ login: nullplay / name: Jaeyeon Won (00de719, 0ba05ad, 0cb33c9, 0e90a73, 0ff3209, 10ab500, 148c1e7, 1529262, 15aa27a, 18e2ffe, 1ad508a, 1be4825, 20e1f1c, 2317d35, 2453785, 25eb1b0, 2cd566c, 31d08e9, 32b8824, 32c52fb, 33ef059, 36f6416, 3b6041b, 3d9427a, 41c4323, 42b0efb, 431aad9, 44c2bd6, 461422d, 479716c, 48a2a9a, 4defd0d, 4e10d27, 4ef394d, 4f40728, 4fae5ab, 502a610, 5151d21, 56d279a, 56da44a, 56ff5cf, 58a3e40, 659b2e0, 66f16c2, 68e9bef, 6dbbc35, 7091ff1, 7320635, 76aefc6, 81fc7ad, 8393e84, 85bc370, 8d48a11, 8f6e4bb, 936ae37, 9383635, 989ece6, 9c005db, 9d12eb3, 9ea883d, a542fea, a9bd8bf, aaac8af, aacb986, ad1a83a, ae2106d, b3997c7, b6be499, b868853, ba807ef, bd51c18, bf46fed, c109ce3, c771e8c, c82d03d, c85ae42, cb0dfed, cbd7973, cca62bf, d50f6a4, d899ffa, e0cdf90, e28c878, e60ce8c, e64eb68, e80d702, eef7673, f317703, f3fe84b, f41a541, fb1d6df, fbc9a75, fbec5fc, fea274d)

jansel · 2025-07-08T05:28:12Z

torch/_inductor/config.py

    tile_reductions: bool = False

+    # Codegen matmul natively with tl.dot without calling template.
+    enable_native_matmul: bool = False


Can you update the PR to turn this on so we can do a full CI run with it enabled to check for bugs?

(After CI is passing we can turn it off again)

I’ve just enabled it, but I’m not fully confident about the performance due to the potential overhead from the reshape and transpose operations. Back in March, Triton compiler didn’t handle these operations efficiently, which resulted in slower performance. To work around this, I had to modify Inductor to emit alternative code—which I had originally planned to include in a follow-up PR.

tmp0 = tl.load(in_ptr0 + (r0_2 + 128 * y0), r0_mask & ymask, eviction_policy='evict_last', other=0.0) tmp1 = tl.load(in_ptr1 + (x1 + 128 * r0_2), r0_mask & xmask, eviction_policy='evict_last', other=0.0) tmp2 = tl.dot(tl.reshape(tmp0, [YBLOCK, R0_BLOCK]), tl.trans(tl.reshape(tmp1, [XBLOCK, R0_BLOCK])), allow_tf32=False)

jansel · 2025-07-08T05:31:07Z

I haven't looked at this super carefully yet, but I kicked off a benchmark run with it enabled here:
https://github.com/pytorch/pytorch/actions/runs/16134785066

It should show up in the dropdown (nullplay_fuse_matmul) here once the jobs finishes:
https://hud.pytorch.org/benchmark/compilers

jansel · 2025-07-08T18:04:38Z

I approved CI. The benchmark run is done, looks like there are a number of models that are failing:

TB Inference: https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Tue,%2001%20Jul%202025%2017:58:59%20GMT&stopTime=Tue,%2008%20Jul%202025%2017:58:59%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=nullplay_fuse_matmul&lCommit=985e1a07e280c19db24a8603aa795c504f0273e7&rBranch=main&rCommit=1586521461c8dc642735466fc143b7d366a858d0

TB Training:
https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Tue,%2001%20Jul%202025%2017:58:59%20GMT&stopTime=Tue,%2008%20Jul%202025%2017:58:59%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=nullplay_fuse_matmul&lCommit=985e1a07e280c19db24a8603aa795c504f0273e7&rBranch=main&rCommit=1586521461c8dc642735466fc143b7d366a858d0

The other benchmark suites can be viewed by selecting "nullplay_fuse_matmul" in the branch dropdown (+ training/inference).

nullplay · 2025-07-11T17:50:58Z

I noticed that when doing torch.float16 matmuls, it was automatically upcasting to float32. Disabling config.triton.codegen_upcast_to_fp32 made things faster. I'm not sure what effect this might have on other parts of the code, but I’ve set config.triton.codegen_upcast_to_fp32 = False for now.

I fixed a few bugs and pushed the changes again. Could you re-run the CI and performance benchmarks?

Just to confirm—there’s no way for me to trigger the CI myself, right? Or is there a way to run the tests locally on my end?

pytorch-bot · 2025-07-12T01:29:41Z

To add the ciflow label ciflow/inductor please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

jansel · 2025-07-12T01:54:54Z

I fixed a few bugs and pushed the changes again. Could you re-run the CI and performance benchmarks?

Something is odd with CI (in this PR and a few others). I don't see any jobs to approve.

There is also a merge conflict. Can you rebase? That will hopefully fix the CI issue.

I noticed that when doing torch.float16 matmuls, it was automatically upcasting to float32. Disabling config.triton.codegen_upcast_to_fp32 made things faster. I'm not sure what effect this might have on other parts of the code, but I’ve set config.triton.codegen_upcast_to_fp32 = False for now.

This is to match what eager pytorch does for pointwise ops. Most of those ops are memory bound so the upcast to fp32 doesn't matter for performance. For matmuls that won't work. We should modify the upcast logic to not apply to matmuls.

Just to confirm—there’s no way for me to trigger the CI myself, right? Or is there a way to run the tests locally on my end?

I just asked to add permissions for you to trigger CI yourself.

You should be able to run tests locally. Failing tests should print out the repro command and the benchamrks are all in the pytorch/benchmarks folder.

jansel · 2025-07-12T21:27:01Z

You should have access to start CI now. I kicked off another benchmark run here: https://github.com/pytorch/pytorch/actions/runs/16242184585

nullplay · 2025-10-12T22:18:01Z

I added few fixes and passes to remove unnecessary changes.
It looks like all the regular and trunk CI tests are passing, except for one unrelated failure.

jansel · 2025-10-14T04:14:32Z

@pytorchbot merge

pytorchmergebot · 2025-10-14T04:16:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@eellison

### Implementation of pytorch#151705 This PR introduces the initial implementation of native `tl.dot` support in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates. To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in pytorch#151705: 1. **Basic support** (this PR) 2. **Lazy broadcasting** for optimal performance (future PR) ### Summary of This PR This PR implements the basic functionality. It does **not** include lazy broadcasting, so the generated kernels may involve explicit `tl.reshape` and `tl.trans` operations before calling `tl.dot`, which introduces some overhead. ### Notable Changes 1. Adds a new config flag: `config.triton.enable_native_matmul` 2. Introduces a new `ops.dot` IR node in Inductor and lowers `aten.mm` and `aten.bmm` to it when native matmul is enabled 3. Enforces tililng suitable for matmul when the native matmul flag is enabled 4. Implements code generation for `ops.dot` 5. Adds Triton autotuning heuristics: for now, I’ve copied the configuration from the existing matmul templates. However, this may not be optimal—it currently takes a long time to tune, and I think there must be a better way to tackle this. @eellison @jansel @PaulZhang12 @shunting314 Pull Request resolved: pytorch#157743 Approved by: https://github.com/jansel

pytorch-bot bot added the module: inductor label Jul 7, 2025

pytorchbot added the open source label Jul 7, 2025

jansel added ciflow/inductor release notes: inductor labels Jul 8, 2025

jansel reviewed Jul 8, 2025

View reviewed changes

pytorch-bot bot removed the ciflow/inductor label Jul 8, 2025

jansel added the ciflow/inductor label Jul 8, 2025

jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 8, 2025

pytorch-bot bot removed the ciflow/inductor label Jul 11, 2025

jansel added the ciflow/inductor label Jul 12, 2025

pytorch-bot bot removed the ciflow/inductor label Jul 12, 2025

nullplay force-pushed the fuse_matmul branch from 764fea0 to 11c68fd Compare July 12, 2025 22:34

pytorch-bot bot added the ciflow/inductor label Jul 12, 2025

nullplay force-pushed the fuse_matmul branch 2 times, most recently from cfed28d to 1793aec Compare August 1, 2025 00:05

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 7, 2025

nullplay force-pushed the fuse_matmul branch from cd4682e to c539c0b Compare August 11, 2025 17:06

nullplay force-pushed the fuse_matmul branch 3 times, most recently from a491f9a to 93ec802 Compare August 29, 2025 15:18

nullplay force-pushed the fuse_matmul branch 2 times, most recently from ec2e039 to 5086148 Compare September 1, 2025 17:34

nullplay added 15 commits October 10, 2025 10:02

suggested change

461422d

lint fix

aaac8af

fix test

d50f6a4

fix

58a3e40

relax the tolerance

989ece6

fix blockshape in tma addmm

148c1e7

revert torchbench precision

8f6e4bb

remove fx pass and do lowering

2cd566c

better can_fuse for native matmul

e80d702

fix lowering logic

36f6416

remove native matmul from postgrad

6dbbc35

don't do native matmul on unbacked symbols

41c4323

addmm baddbmm zero annihilation

ad1a83a

turn native matmul off

56d279a

fix lint

ae2106d

nullplay force-pushed the fuse_matmul branch from 5542340 to ae2106d Compare October 10, 2025 14:11

remove unnecessary changes

eef7673

jansel added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 10, 2025

fix indexing in store and store_reduction and triton_meta.get

8d48a11

jansel approved these changes Oct 14, 2025

View reviewed changes

pytorchmergebot added the merging label Oct 14, 2025

pytorchmergebot closed this in ac529df Oct 14, 2025

pytorchmergebot added Merged and removed merging labels Oct 14, 2025

HBN-MichalSzy mentioned this pull request Nov 5, 2025

Error during Intel loadBinary: ZE_RESULT_ERROR_INVALID_KERNEL_NAME intel/intel-xpu-backend-for-triton#5394

Closed

Conversation

nullplay commented Jul 7, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation of #151705

Summary of This PR

Notable Changes

Uh oh!

pytorch-bot bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157743

❗ 1 Active SEVs

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

linux-foundation-easycla bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jansel Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

nullplay Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

jansel commented Jul 8, 2025

Uh oh!

jansel commented Jul 8, 2025

Uh oh!

nullplay commented Jul 11, 2025

Uh oh!

pytorch-bot bot commented Jul 12, 2025

Uh oh!

jansel commented Jul 12, 2025

Uh oh!

jansel commented Jul 12, 2025

Uh oh!

nullplay commented Oct 12, 2025

Uh oh!

jansel commented Oct 14, 2025

Uh oh!

pytorchmergebot commented Oct 14, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

nullplay commented Jul 7, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 7, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jul 7, 2025 •

edited

Loading