Add scaled_grouped_mm_v2 and python API by slayton58 · Pull Request #165154 · pytorch/pytorch

slayton58 · 2025-10-10T16:07:37Z

Stack from ghstack (oldest at bottom):

-> Add scaled_grouped_mm_v2 and python API #165154

Summary:

Add torch._scaled_grouped_mm_v2 with more functionality and
extensibility for future formats
Add torch.nn.functional.scaled_grouped_mm as public entrypoint
Test both original and v2 functionality

Test Plan:

pytest -svv -k grouped test/test_scaled_matmul_cuda.py

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton simonlayton@meta.com

cc @albanD @mruberry @jbschlosser @walterddr @mikaylagawarecki

Summary: * Add `torch._scaled_grouped_mm_v2` with more functionality and extensibility for future formats * Add `torch.nn.functional.scaled_grouped_mm` as public entrypoint * Test both original and v2 functionality Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> [ghstack-poisoned]

pytorch-bot · 2025-10-10T16:07:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165154

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 319da7a with merge base 3a110c9 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge) (gh) (similar failure)
export/tests/test_target_recipes.py::TestTargetRecipes::test_w2l_model
trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh) (similar failure)
RuntimeError: doctests 1/1 failed!

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: * Add `torch._scaled_grouped_mm_v2` with more functionality and extensibility for future formats * Add `torch.nn.functional.scaled_grouped_mm` as public entrypoint * Test both original and v2 functionality Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlaytonmeta.com> ghstack-source-id: 7c9c6b3 Pull Request resolved: #165154

slayton58 · 2025-10-10T16:10:41Z

@pytorchbot label "release notes: quantization"

github-actions · 2025-10-10T16:11:57Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

[ghstack-poisoned]

Summary: * Add `torch._scaled_grouped_mm_v2` with more functionality and extensibility for future formats * Add `torch.nn.functional.scaled_grouped_mm` as public entrypoint * Test both original and v2 functionality Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlaytonmeta.com> ghstack-source-id: 92775cb Pull Request resolved: #165154 Signed-off-by: Simon Layton <simonlayton@meta.com>

[ghstack-poisoned]

Summary: * Add `torch._scaled_grouped_mm_v2` with more functionality and extensibility for future formats * Add `torch.nn.functional.scaled_grouped_mm` as public entrypoint * Test both original and v2 functionality Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlaytonmeta.com> ghstack-source-id: 70b4b2d Pull Request resolved: #165154 Signed-off-by: Simon Layton <simonlayton@meta.com>

aten/src/ATen/native/cuda/Blas.cpp

slayton58 · 2025-10-11T01:47:32Z

aten/src/ATen/native/cuda/Blas.cpp

+
+  // NOTE(slayton): For sub-1B formats want contraction_dim argument?
+  if (!a_is_2d || !b_is_2d) {
+    TORCH_CHECK_VALUE(mat_a.size(-1) == mat_b.size(-2), "contraction dimension of mat_a and mat_b must match");


@danielvegamyhre In real-world use-cases, do you end up needing to pass any transposed inputs? If so, I think we want a contraction_dim argument, as .t() becomes non-"free" for sub-1B formats (like e2m1x2)

(sorry missed this comment somehow) - discussed this offline but for posterity's sake:

fbgemm mx8mx8bf16 grouped mm API requires the B tensor be non-transposed (e.g., E,N,K)

torch._scaled_grouped_mm requires the B tensor be pre-transposed (e.g., E,K,N) so before dispatching to fbgemm we do a B.transpose(-2, -1)

If so, I think we want a contraction_dim argument, as .t() becomes non-"free" for sub-1B formats (like e2m1x2)

Could we just update fbgemm to have consistent API with torch._scaled_grouped_mm, accepting B as pre-transposed? Would need to take a look at how this affects perf but it should theoretically be fine I believe

In this case that'd do it, but in general we do want to be able to support the full matrix of potential contractions (just like regular gemm does with it's T/NT modes), so both cases should ideally work.

slayton58 · 2025-10-12T02:11:09Z

aten/src/ATen/native/cuda/Blas.cpp

+      // MXFP8 expects float8_e8m0fnu scales.
+      TORCH_CHECK_VALUE(scale_a[0].scalar_type() == at::kFloat8_e8m0fnu && scale_b[0].scalar_type() == at::kFloat8_e8m0fnu,
+          "For MXFP8 grouped gemm, both scales must be float8_e8m0fnu tensors.");
+      TORCH_CHECK_VALUE(swizzle_a_enum[0] == SwizzleType::SWIZZLE_32_4_4 && swizzle_b_enum[0] == SwizzleType::SWIZZLE_32_4_4,


Note to self: ROCM doesn't need swizzle afaik

cc @jeffdaily @petrex can you confirm? I read through https://rocm.blogs.amd.com/software-tools-optimization/matrix-cores-cdna/README.html

but still wasn't 100% sure

My reading is the same as yours @drisspg, and I'll make the appropriate change in the code, but a confirmation would be very welcome!

[ghstack-poisoned]

Summary: * Add `torch._scaled_grouped_mm_v2` with more functionality and extensibility for future formats * Add `torch.nn.functional.scaled_grouped_mm` as public entrypoint * Test both original and v2 functionality Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlaytonmeta.com> ghstack-source-id: 1e34dca Pull Request resolved: #165154 Signed-off-by: Simon Layton <simonlayton@meta.com>

[ghstack-poisoned]

Summary: * Add `torch._scaled_grouped_mm_v2` with more functionality and extensibility for future formats * Add `torch.nn.functional.scaled_grouped_mm` as public entrypoint * Test both original and v2 functionality Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlaytonmeta.com> ghstack-source-id: 458ff5e Pull Request resolved: #165154 Signed-off-by: Simon Layton <simonlayton@meta.com>

aten/src/ATen/native/cuda/Blas.cpp

[ghstack-poisoned]

Summary: * Add `torch._scaled_grouped_mm_v2` with more functionality and extensibility for future formats * Add `torch.nn.functional.scaled_grouped_mm` as public entrypoint * Test both original and v2 functionality Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlaytonmeta.com> ghstack-source-id: 498808e Pull Request resolved: #165154 Signed-off-by: Simon Layton <simonlayton@meta.com>

[ghstack-poisoned]

Summary: * Add `torch._scaled_grouped_mm_v2` with more functionality and extensibility for future formats * Add `torch.nn.functional.scaled_grouped_mm` as public entrypoint * Test both original and v2 functionality Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlaytonmeta.com> ghstack-source-id: 4e65740 Pull Request resolved: #165154 Signed-off-by: Simon Layton <simonlayton@meta.com>

[ghstack-poisoned]

Summary: * Add `torch._scaled_grouped_mm_v2` with more functionality and extensibility for future formats * Add `torch.nn.functional.scaled_grouped_mm` as public entrypoint * Test both original and v2 functionality Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlaytonmeta.com> ghstack-source-id: c39fb9a Pull Request resolved: #165154 Signed-off-by: Simon Layton <simonlayton@meta.com>

slayton58 · 2025-10-14T14:34:50Z

@pytorchbot rebase

pytorchmergebot · 2025-10-14T14:36:48Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

Summary: * Add `torch._scaled_grouped_mm_v2` with more functionality and extensibility for future formats * Add `torch.nn.functional.scaled_grouped_mm` as public entrypoint * Test both original and v2 functionality Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlaytonmeta.com> ghstack-source-id: 554cdd1 Pull Request resolved: #165154 Signed-off-by: Simon Layton <simonlayton@meta.com>

pytorchmergebot · 2025-10-14T14:37:04Z

Successfully rebased gh/slayton58/27/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/165154)

slayton58 · 2025-10-15T13:53:16Z

@pytorchbot merge

pytorchmergebot · 2025-10-15T13:55:19Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: * Add `torch._scaled_grouped_mm_v2` with more functionality and extensibility for future formats * Add `torch.nn.functional.scaled_grouped_mm` as public entrypoint * Test both original and v2 functionality Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: pytorch#165154 Approved by: https://github.com/drisspg, https://github.com/danielvegamyhre

slayton58 requested review from Aidyn-A, albanD, eqy, jbschlosser, mikaylagawarecki and syed-ahmed as code owners October 10, 2025 16:07

pytorch-bot bot added the release notes: quantization release notes category label Oct 10, 2025

Update

a3592d7

[ghstack-poisoned]

slayton58 requested review from danielvegamyhre and drisspg October 10, 2025 17:41

Update

9107d14

[ghstack-poisoned]

drisspg added module: nn Related to torch.nn topic: new features topic category labels Oct 10, 2025

drisspg reviewed Oct 10, 2025

View reviewed changes

aten/src/ATen/native/cuda/Blas.cpp Show resolved Hide resolved

slayton58 commented Oct 11, 2025

View reviewed changes

slayton58 added ciflow/h100 ciflow/b200 labels Oct 11, 2025

slayton58 commented Oct 12, 2025

View reviewed changes

Update

7a3e5e5

[ghstack-poisoned]

albanD removed their request for review October 13, 2025 15:24

Update

672a623

[ghstack-poisoned]

drisspg reviewed Oct 13, 2025

View reviewed changes

aten/src/ATen/native/cuda/Blas.cpp Show resolved Hide resolved

drisspg reviewed Oct 13, 2025

View reviewed changes

aten/src/ATen/native/cuda/Blas.cpp Show resolved Hide resolved

drisspg approved these changes Oct 13, 2025

View reviewed changes

drisspg added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 13, 2025

Update

75f1116

[ghstack-poisoned]

Update

9eca5e1

[ghstack-poisoned]

Update

4511046

[ghstack-poisoned]

Update

319da7a

[ghstack-poisoned]

danielvegamyhre approved these changes Oct 15, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 15, 2025

pytorchmergebot added the merging label Oct 15, 2025

pytorchmergebot added the Merged label Oct 15, 2025

pytorchmergebot closed this in 7c6c5d0 Oct 15, 2025

pytorchmergebot removed the merging label Oct 15, 2025

github-actions bot deleted the gh/slayton58/27/head branch November 15, 2025 02:15

Conversation

slayton58 commented Oct 10, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165154

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

slayton58 commented Oct 10, 2025

Uh oh!

github-actions bot commented Oct 10, 2025

Attention! native_functions.yaml was changed

Uh oh!

Uh oh!

slayton58 Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

slayton58 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

slayton58 Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

drisspg Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slayton58 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

slayton58 commented Oct 14, 2025

Uh oh!

pytorchmergebot commented Oct 14, 2025

Uh oh!

pytorchmergebot commented Oct 14, 2025

Uh oh!

slayton58 commented Oct 15, 2025

Uh oh!

pytorchmergebot commented Oct 15, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

slayton58 commented Oct 10, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 10, 2025 •

edited

Loading

drisspg Oct 13, 2025 •

edited

Loading