[xpu][feature] [3/3] Register the `scaled_mm` and `scaled_mm_v2` for xpu by Stonepia · Pull Request #166056 · pytorch/pytorch

Stonepia · 2025-10-22T07:42:10Z

This PR registers the scaled_mm op for XPU support.

It does the following:

Registered the _scaled_mm and _scaled_mm_v2 op for XPU.
Enables XPU tests in test_scaled_matmul_cuda.py.
Update torch-xpu-ops pin to remove fallback scaled_mm to CPU implementation.

PR Stack:

[xpu][feature] [1/3] add fp8 scaled_mm implementation for XPU #165978 : implementation of XPU scaled_mm and oneDNN kernel
[XPU] [Feature] [2/3] add fp8 scaled_mm_v2 implementation for XPU #167518 : implementation of XPU scaled_mm_v2
-> [xpu][feature] [3/3] Register the scaled_mm and scaled_mm_v2 for xpu #166056 : Op registration

Task tracker:

We will track all the scaled_mm related tasks in: #167170

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @gujinghui @fengyuan14 @guangyey @chenyang78

pytorch-bot · 2025-10-22T07:42:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166056

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit bf9795f with merge base a7dc6da ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / linux-jammy-rocm-py3.10 / test (default, 3, 6, linux.rocm.gpu.gfx942.1) (gh) (disabled by #163689)
test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile
trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx942.1) (gh) (similar failure)
test/inductor/test_native_matmul.py::TestTritonDotReduction::test_matmul_fp16

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Stonepia · 2025-10-22T07:45:18Z

@pytorchbot label "module: xpu"

github-actions · 2025-10-22T07:46:26Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

github-actions · 2025-10-22T07:46:27Z

Attention! PyTorch one of the C-stable API file was changed

You MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function.

Caused by:

torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h

Stonepia · 2025-10-27T08:31:11Z

@pytorchbot rebase

pytorchmergebot · 2025-10-27T08:32:51Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-10-27T08:32:54Z

Successfully rebased xpu/register_scaled_mm_xpu onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xpu/register_scaled_mm_xpu && git pull --rebase)

Stonepia · 2025-11-06T01:15:02Z

@pytorchbot rebase

pytorchmergebot · 2025-11-06T01:16:39Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-11-06T01:16:41Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/166056/head returned non-zero exit code 1

Rebasing (1/3)
Rebasing (2/3)
Auto-merging test/test_scaled_matmul_cuda.py
CONFLICT (content): Merge conflict in test/test_scaled_matmul_cuda.py
error: could not apply 89eeddd8e63... Enable `scaled_mm` tests for xpu
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 89eeddd8e63... # Enable `scaled_mm` tests for xpu

Raised by https://github.com/pytorch/pytorch/actions/runs/19121565673

slayton58 · 2025-11-06T14:43:16Z

This functionality is great!

The torch._scaled_mm API is somewhat deprecated at this point - it would be much much better to tie into torch._scaled_mm_v2 and the python front-end for it, torch.nn.functional.scaled_mm -- this is the public face of scaled gemms, and where work is expected to be in the future.

It's not a big difference in API, the main change is that scaling types are explicitly passed to the API, rather than inferred from the input & scale shapes. I'd be happy to talk through the necessary differences if you need.

Stonepia · 2025-11-07T01:18:46Z

This functionality is great!

The torch._scaled_mm API is somewhat deprecated at this point - it would be much much better to tie into torch._scaled_mm_v2 and the python front-end for it, torch.nn.functional.scaled_mm -- this is the public face of scaled gemms, and where work is expected to be in the future.

It's not a big difference in API, the main change is that scaling types are explicitly passed to the API, rather than inferred from the input & scale shapes. I'd be happy to talk through the necessary differences if you need.

thanks for the suggestion! I will refactor the code to support v2 version. Originally, I thought that v2 is not stabled enough, so there are syncing efforts because of code changes.