[ROCm] Enable group gemm through CK#166334
[ROCm] Enable group gemm through CK#166334jagadish-amd wants to merge 17 commits intopytorch:mainfrom
Conversation
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Also added comments. Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
python test/test_matmul_cuda.py -v -k test_grouped_gemm_2d_3d Ran 24 tests in 5.566s OK python test/test_matmul_cuda.py -v -k test_grouped_gemm_2d_2d Ran 24 tests in 5.537s OK Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
All test cases are passing with forward and backward pass. Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166334
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New FailuresAs of commit 5df10f1 with merge base b4e4ee8 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "topic: not user facing" |
|
group_gemm tests are passing on gfx942. |
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
|
Here are the errors: |
error: missing field 'stride_Ds_' initializer [-Werror,-Wmissing-field-initializers] Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
looking at the errors, it's due to |
|
@atalman would you like to do a Meta import to verify our changes and prevent more reverts? |
Fixes #161366 All the 4 types of dimension matrix are supported. 2d-2d, 2d-3d, 3d-3d, 3d-2d. The corresponding test cases in test_matmul_cuda are working for both forward and backward pass. The CK path is enabled for gfx942, gfx950. ToDo: Need to enable support on gfx90a since the ck kernel used in this commit produces gpu error, might require a different CK kernel config, based on the profiler result on gfx90a. Pull Request resolved: #166334 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
This reverts commit 1fa520e. Reverted #166334 on behalf of https://github.com/atalman due to Internal build failures ([comment](#166334 (comment)))
|
@atalman how did the internal import go? |
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
There was a problem hiding this comment.
lgtm. Internal change is clear. Please fix lint:
Lint / lintrunner-pyrefly-all / linux-job (gh)
Lint for torch/utils/tensorboard/writer.py:
|
@jeffdaily the lint errors are not related to my changes, can we rerun the CI and attempt to merge. |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Rebase failed due to Command Raised by https://github.com/pytorch/pytorch/actions/runs/19087534422 |
|
@pytorchbot merge -f "lint good, rocm good, upstream broke mi355 tests, merging anyway, fixing separately in #167066" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
| #ifndef USE_ROCM | ||
| at::cuda::detail::bf16bf16_grouped_mm(mat_a, mat_b, offs, bias, out); | ||
| #else | ||
| at::hip::detail::group_gemm_ck(mat_a, mat_b, offs, bias, out); | ||
| #endif |
There was a problem hiding this comment.
This breaks builds on Windows, where CK is currently not enabled: ROCm/TheRock#2054
2025-11-07T07:51:12.4977670Z [7076/7084] Linking CXX shared library bin\torch_hip.dll
2025-11-07T07:51:12.4977863Z FAILED: [code=4294967295] bin/torch_hip.dll lib/torch_hip.lib
2025-11-07T07:51:12.4979510Z C:\Windows\system32\cmd.exe /C "cd . && C:\home\runner\_work\_tool\Python\3.12.10\x64\Lib\site-packages\cmake\data\bin\cmake.exe -E vs_link_dll --msvc-ver=1944 --intdir=caffe2\CMakeFiles\torch_hip.dir --rc="C:\Program Files (x86)\Windows Kits\10\bin\10.0.26100.0\x64\rc.exe" --mt="C:\Program Files (x86)\Windows Kits\10\bin\10.0.26100.0\x64\mt.exe" --manifests -- C:\home\runner\_work\_tool\Python\3.12.10\x64\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO && cd ."
2025-11-07T07:51:12.4980262Z LINK: command "C:\home\runner\_work\_tool\Python\3.12.10\x64\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /MANIFEST:EMBED,ID=2" failed (exit code 1) with the following output:
2025-11-07T07:51:12.4980745Z lld-link: error: undefined symbol: void __cdecl at::hip::detail::group_gemm_ck(class at::Tensor const &, class at::Tensor const &, class std::optional<class at::Tensor> const &, class std::optional<class at::Tensor> const &, class at::Tensor &)
2025-11-07T07:51:12.4980868Z
2025-11-07T07:51:12.4981526Z >>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\native\hip\GroupedBlas.cpp.obj:(class at::Tensor __cdecl at::native::_grouped_mm_cuda(class at::Tensor const &, class at::Tensor const &, class std::optional<class at::Tensor> const &, class std::optional<class at::Tensor> const &, class std::optional<enum c10::ScalarType>))
2025-11-07T07:51:12.4981649Z
2025-11-07T07:51:12.4981729Z ninja: build stopped: subcommand failed.
Can this be conditioned on CK being enabled too?
Lines 252 to 253 in 724cd32
Perhaps this code style?
pytorch/aten/src/ATen/cuda/CUDABlas.cpp
Lines 871 to 877 in 724cd32
There was a problem hiding this comment.
Sure @ScottTodd , I will raise a PR soon.
There was a problem hiding this comment.
A fix-forward is in review at #167403. Can we revert first though? I'm not sure if I have permission to trigger the bot for that...
There was a problem hiding this comment.
Confirmed that the fix-forward was successful and our nightly release builds are functional again. Latest status update at ROCm/TheRock#2054 (comment).
Fixes #161366
All the 4 types of dimension matrix are supported.
2d-2d, 2d-3d, 3d-3d, 3d-2d. The corresponding test cases in test_matmul_cuda are working
for both forward and backward pass.
The CK path is enabled for gfx942, gfx950.
ToDo: Need to enable support on gfx90a since the ck kernel used in this commit produces gpu error,
might require a different CK kernel config, based on the profiler result on gfx90a.
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd