Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157743
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (1 Unrelated Failure)As of commit 8d48a11 with merge base b8be796 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
torch/_inductor/config.py
Outdated
| tile_reductions: bool = False | ||
|
|
||
| # Codegen matmul natively with tl.dot without calling template. | ||
| enable_native_matmul: bool = False |
There was a problem hiding this comment.
Can you update the PR to turn this on so we can do a full CI run with it enabled to check for bugs?
(After CI is passing we can turn it off again)
There was a problem hiding this comment.
I’ve just enabled it, but I’m not fully confident about the performance due to the potential overhead from the reshape and transpose operations. Back in March, Triton compiler didn’t handle these operations efficiently, which resulted in slower performance. To work around this, I had to modify Inductor to emit alternative code—which I had originally planned to include in a follow-up PR.
tmp0 = tl.load(in_ptr0 + (r0_2 + 128 * y0), r0_mask & ymask, eviction_policy='evict_last', other=0.0)
tmp1 = tl.load(in_ptr1 + (x1 + 128 * r0_2), r0_mask & xmask, eviction_policy='evict_last', other=0.0)
tmp2 = tl.dot(tl.reshape(tmp0, [YBLOCK, R0_BLOCK]), tl.trans(tl.reshape(tmp1, [XBLOCK, R0_BLOCK])), allow_tf32=False)|
I haven't looked at this super carefully yet, but I kicked off a benchmark run with it enabled here: It should show up in the dropdown (nullplay_fuse_matmul) here once the jobs finishes: |
|
I approved CI. The benchmark run is done, looks like there are a number of models that are failing: The other benchmark suites can be viewed by selecting "nullplay_fuse_matmul" in the branch dropdown (+ training/inference). |
|
I noticed that when doing I fixed a few bugs and pushed the changes again. Could you re-run the CI and performance benchmarks? Just to confirm—there’s no way for me to trigger the CI myself, right? Or is there a way to run the tests locally on my end? |
|
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
Something is odd with CI (in this PR and a few others). I don't see any jobs to approve. There is also a merge conflict. Can you rebase? That will hopefully fix the CI issue.
This is to match what eager pytorch does for pointwise ops. Most of those ops are memory bound so the upcast to fp32 doesn't matter for performance. For matmuls that won't work. We should modify the upcast logic to not apply to matmuls.
I just asked to add permissions for you to trigger CI yourself. You should be able to run tests locally. Failing tests should print out the repro command and the benchamrks are all in the pytorch/benchmarks folder. |
|
You should have access to start CI now. I kicked off another benchmark run here: https://github.com/pytorch/pytorch/actions/runs/16242184585 |
cfed28d to
1793aec
Compare
a491f9a to
93ec802
Compare
ec2e039 to
5086148
Compare
|
I added few fixes and passes to remove unnecessary changes. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
### Implementation of pytorch#151705 This PR introduces the initial implementation of native `tl.dot` support in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates. To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in pytorch#151705: 1. **Basic support** (this PR) 2. **Lazy broadcasting** for optimal performance (future PR) ### Summary of This PR This PR implements the basic functionality. It does **not** include lazy broadcasting, so the generated kernels may involve explicit `tl.reshape` and `tl.trans` operations before calling `tl.dot`, which introduces some overhead. ### Notable Changes 1. Adds a new config flag: `config.triton.enable_native_matmul` 2. Introduces a new `ops.dot` IR node in Inductor and lowers `aten.mm` and `aten.bmm` to it when native matmul is enabled 3. Enforces tililng suitable for matmul when the native matmul flag is enabled 4. Implements code generation for `ops.dot` 5. Adds Triton autotuning heuristics: for now, I’ve copied the configuration from the existing matmul templates. However, this may not be optimal—it currently takes a long time to tune, and I think there must be a better way to tackle this. @eellison @jansel @PaulZhang12 @shunting314 Pull Request resolved: pytorch#157743 Approved by: https://github.com/jansel
Implementation of #151705
This PR introduces the initial implementation of native
tl.dotsupport in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates.To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in #151705:
Summary of This PR
This PR implements the basic functionality. It does not include lazy broadcasting, so the generated kernels may involve explicit
tl.reshapeandtl.transoperations before callingtl.dot, which introduces some overhead.Notable Changes
config.triton.enable_native_matmulops.dotIR node in Inductor and lowersaten.mmandaten.bmmto it when native matmul is enabledops.dot@eellison @jansel @PaulZhang12 @shunting314
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela @mlazos