[WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern#161680
[WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern#161680bbeckca wants to merge 1 commit intopytorch:mainfrom
Conversation
|
This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @bbeckca, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team. |
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161680
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 5eb0556 with merge base d25c35d ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D80882442 |
57df3f6 to
95bedc7
Compare
|
This pull request was exported from Phabricator. Differential Revision: D80882442 |
…pytorch#161680) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442
95bedc7 to
99c435e
Compare
…pytorch#161680) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442
|
This pull request was exported from Phabricator. Differential Revision: D80882442 |
99c435e to
46dee89
Compare
…pytorch#161680) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442
|
This pull request was exported from Phabricator. Differential Revision: D80882442 |
|
@pytorchbot label "topic: not user facing" |
|
Hi @jerryzh168 @danielvegamyhre, this PR corresponds with the internal diff D80882442 which has already been accepted. Wondering if you're also able to sign off here on the OSS side? |
46dee89 to
774b70d
Compare
|
This pull request was exported from Phabricator. Differential Revision: D80882442 |
…pytorch#161680) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442
774b70d to
6622d4d
Compare
…pytorch#161680) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442
…pytorch#161680) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442
|
This pull request was exported from Phabricator. Differential Revision: D80882442 |
…pytorch#161680) Summary: Pull Request resolved: pytorch#161680 What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442
eb8b3e3 to
de22a20
Compare
…pytorch#161680) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442
01640a5 to
86cb1f0
Compare
86cb1f0 to
5d90033
Compare
…pytorch#161680) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442
…pytorch#161680) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442
…pytorch#161680) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442
5d90033 to
5eb0556
Compare
…pytorch#161680) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…pytorch#161680) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442 Pull Request resolved: pytorch#161680 Approved by: https://github.com/jerryzh168
…pytorch#161680) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442 Pull Request resolved: pytorch#161680 Approved by: https://github.com/jerryzh168
…pytorch#161680) Summary: What: Enables CUDA support for int8_mm woq optimization pattern by: - Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Reviewed By: jerryzh168 Differential Revision: D80882442 Pull Request resolved: pytorch#161680 Approved by: https://github.com/jerryzh168
Fixes #ISSUE_NUMBER Failing due to memory leak, ex https://github.com/pytorch/pytorch/actions/runs/18401518298/job/52434584458 ``` 2025-10-10T11:07:42.9485277Z _ TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16 _ 2025-10-10T11:07:42.9485389Z Traceback (most recent call last): 2025-10-10T11:07:42.9485869Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3278, in wrapper 2025-10-10T11:07:42.9485966Z method(*args, **kwargs) 2025-10-10T11:07:42.9486365Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3278, in wrapper 2025-10-10T11:07:42.9486454Z method(*args, **kwargs) 2025-10-10T11:07:42.9486849Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3277, in wrapper 2025-10-10T11:07:42.9486933Z with policy(): 2025-10-10T11:07:42.9487380Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2654, in __exit__ 2025-10-10T11:07:42.9487473Z raise RuntimeError(msg) 2025-10-10T11:07:42.9488533Z RuntimeError: CUDA driver API confirmed a leak in __main__.TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16! Caching allocator allocated memory was 19456 and is now reported as 29184 on device 0. CUDA driver allocated memory was 356712448 and is now 358809600. 2025-10-10T11:07:42.9488543Z 2025-10-10T11:07:42.9488722Z To execute this test, run the following from the base repo dir: 2025-10-10T11:07:42.9489520Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_cuda_select_algorithm.py TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16 2025-10-10T11:07:42.9489525Z 2025-10-10T11:07:42.9489748Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Got added in #161680 Pull Request resolved: #165147 Approved by: https://github.com/bbeckca
) Fixes #ISSUE_NUMBER Failing due to memory leak, ex https://github.com/pytorch/pytorch/actions/runs/18401518298/job/52434584458 ``` 2025-10-10T11:07:42.9485277Z _ TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16 _ 2025-10-10T11:07:42.9485389Z Traceback (most recent call last): 2025-10-10T11:07:42.9485869Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3278, in wrapper 2025-10-10T11:07:42.9485966Z method(*args, **kwargs) 2025-10-10T11:07:42.9486365Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3278, in wrapper 2025-10-10T11:07:42.9486454Z method(*args, **kwargs) 2025-10-10T11:07:42.9486849Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3277, in wrapper 2025-10-10T11:07:42.9486933Z with policy(): 2025-10-10T11:07:42.9487380Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2654, in __exit__ 2025-10-10T11:07:42.9487473Z raise RuntimeError(msg) 2025-10-10T11:07:42.9488533Z RuntimeError: CUDA driver API confirmed a leak in __main__.TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16! Caching allocator allocated memory was 19456 and is now reported as 29184 on device 0. CUDA driver allocated memory was 356712448 and is now 358809600. 2025-10-10T11:07:42.9488543Z 2025-10-10T11:07:42.9488722Z To execute this test, run the following from the base repo dir: 2025-10-10T11:07:42.9489520Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_cuda_select_algorithm.py TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16 2025-10-10T11:07:42.9489525Z 2025-10-10T11:07:42.9489748Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Got added in pytorch#161680 Pull Request resolved: pytorch#165147 Approved by: https://github.com/bbeckca
Summary:
What: Enables CUDA support for int8_mm woq optimization pattern by:
Why: Extend WOQ to more device types
Test Plan:
Rollback Plan:
Reviewed By: jerryzh168
Differential Revision: D80882442
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben