[WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern#161680

Closed

bbeckca wants to merge 1 commit intopytorch:mainfrom

bbeckca:export-D80882442

Contributor

bbeckca commented Aug 28, 2025 •

edited by pytorch-bot bot

Loading

Summary:
What: Enables CUDA support for int8_mm woq optimization pattern by:

Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
Updating pattern validation to accept CUDA devices
Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:

buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

bbeckca requested review from eqy and syed-ahmed as code owners

August 28, 2025 01:55

pytorch-bot bot commented Aug 28, 2025

This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @bbeckca, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team.

pytorch-bot bot added the module: inductor label

pytorch-bot bot commented Aug 28, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161680

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5eb0556 with merge base d25c35d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Contributor

facebook-github-bot commented Aug 28, 2025

This pull request was exported from Phabricator. Differential Revision: D80882442

facebook-github-bot added the fb-exported label

bbeckca force-pushed the export-D80882442 branch from 57df3f6 to 95bedc7 Compare

August 29, 2025 17:07

Contributor

facebook-github-bot commented Aug 29, 2025

This pull request was exported from Phabricator. Differential Revision: D80882442

bbeckca added a commit to bbeckca/pytorch that referenced this pull request


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

95bedc7

…pytorch#161680)

Summary:

What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

bbeckca force-pushed the export-D80882442 branch from 95bedc7 to 99c435e Compare

August 29, 2025 17:10

bbeckca added a commit to bbeckca/pytorch that referenced this pull request


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

99c435e

…pytorch#161680)

Summary:

What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

Contributor

facebook-github-bot commented Aug 29, 2025

This pull request was exported from Phabricator. Differential Revision: D80882442

bbeckca force-pushed the export-D80882442 branch from 99c435e to 46dee89 Compare

August 29, 2025 20:33

bbeckca added a commit to bbeckca/pytorch that referenced this pull request


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

46dee89

…pytorch#161680)

Summary:

What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

Contributor

facebook-github-bot commented Aug 29, 2025

This pull request was exported from Phabricator. Differential Revision: D80882442

Contributor Author

bbeckca commented Aug 29, 2025

@pytorchbot label "topic: not user facing"

pytorch-bot bot added the topic: not user facing label

Contributor Author

bbeckca commented Aug 29, 2025

Hi @jerryzh168 @danielvegamyhre, this PR corresponds with the internal diff D80882442 which has already been accepted. Wondering if you're also able to sign off here on the OSS side?

bbeckca force-pushed the export-D80882442 branch from 46dee89 to 774b70d Compare

September 3, 2025 21:03

Contributor

facebook-github-bot commented Sep 3, 2025

This pull request was exported from Phabricator. Differential Revision: D80882442

bbeckca added a commit to bbeckca/pytorch that referenced this pull request


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

774b70d

…pytorch#161680)

Summary:

What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

bbeckca force-pushed the export-D80882442 branch from 774b70d to 6622d4d Compare

September 7, 2025 21:36

bbeckca added a commit to bbeckca/pytorch that referenced this pull request


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

8a39b47

…pytorch#161680)

Summary:

What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

bbeckca added a commit to bbeckca/pytorch that referenced this pull request


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

6622d4d

…pytorch#161680)

Summary:

What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

Contributor

facebook-github-bot commented Sep 7, 2025

This pull request was exported from Phabricator. Differential Revision: D80882442

bbeckca added a commit to bbeckca/pytorch that referenced this pull request


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

eb8b3e3

…pytorch#161680)

Summary:
Pull Request resolved: pytorch#161680

What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

bbeckca force-pushed the export-D80882442 branch 2 times, most recently from eb8b3e3 to de22a20 Compare

September 8, 2025 21:37

bbeckca added a commit to bbeckca/pytorch that referenced this pull request


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

de22a20

…pytorch#161680)

Summary:

What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

bbeckca force-pushed the export-D80882442 branch from 01640a5 to 86cb1f0 Compare

September 11, 2025 01:11

jerryzh168 approved these changes

View reviewed changes

pytorch-bot bot added the ciflow/trunk label

bbeckca force-pushed the export-D80882442 branch from 86cb1f0 to 5d90033 Compare

September 12, 2025 22:53

pytorch-bot bot removed the ciflow/trunk label

Contributor

facebook-github-bot commented Sep 12, 2025

@bbeckca has exported this pull request. If you are a Meta employee, you can view the originating diff in D80882442.

facebook-github-bot added the meta-exported label

bbeckca added a commit to bbeckca/pytorch that referenced this pull request


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

5d90033

…pytorch#161680)

Summary:

What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

bbeckca added a commit to bbeckca/pytorch that referenced this pull request


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

be54ed8

…pytorch#161680)

Summary:

What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

5eb0556

…pytorch#161680)

Summary:

What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

bbeckca force-pushed the export-D80882442 branch from 5d90033 to 5eb0556 Compare

September 12, 2025 23:30

Contributor

facebook-github-bot commented Sep 12, 2025

@bbeckca has exported this pull request. If you are a Meta employee, you can view the originating diff in D80882442.

bbeckca added a commit to bbeckca/pytorch that referenced this pull request


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

850cfcc

…pytorch#161680)

Summary:

What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

Contributor

jerryzh168 commented Sep 17, 2025

@pytorchbot merge

pytorch-bot bot added the ciflow/trunk label

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Sep 17, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Collaborator

pytorchmergebot commented Sep 17, 2025

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

Contributor

facebook-github-bot commented Sep 17, 2025

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

Collaborator

pytorchmergebot commented Sep 17, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot closed this in

c52c405

pytorchmergebot added Merged and removed merging labels

mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

…pytorch#161680)

Summary:
What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

Pull Request resolved: pytorch#161680
Approved by: https://github.com/jerryzh168

cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

547d336

…pytorch#161680)

Summary:
What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

Pull Request resolved: pytorch#161680
Approved by: https://github.com/jerryzh168

dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request


          [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (…

5ef236f

…pytorch#161680)

Summary:
What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

Pull Request resolved: pytorch#161680
Approved by: https://github.com/jerryzh168

clee2000 mentioned this pull request

Disable failing test_int8_woq_mm_cuda on slow grad check #165147

Closed

pytorchmergebot pushed a commit that referenced this pull request


          Disable failing test_int8_woq_mm_cuda on slow grad check (#165147)

0055f07

Fixes #ISSUE_NUMBER
Failing due to memory leak, ex
https://github.com/pytorch/pytorch/actions/runs/18401518298/job/52434584458

```
2025-10-10T11:07:42.9485277Z _ TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16 _
2025-10-10T11:07:42.9485389Z Traceback (most recent call last):
2025-10-10T11:07:42.9485869Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3278, in wrapper
2025-10-10T11:07:42.9485966Z     method(*args, **kwargs)
2025-10-10T11:07:42.9486365Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3278, in wrapper
2025-10-10T11:07:42.9486454Z     method(*args, **kwargs)
2025-10-10T11:07:42.9486849Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3277, in wrapper
2025-10-10T11:07:42.9486933Z     with policy():
2025-10-10T11:07:42.9487380Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2654, in __exit__
2025-10-10T11:07:42.9487473Z     raise RuntimeError(msg)
2025-10-10T11:07:42.9488533Z RuntimeError: CUDA driver API confirmed a leak in __main__.TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16! Caching allocator allocated memory was 19456 and is now reported as 29184 on device 0. CUDA driver allocated memory was 356712448 and is now 358809600.
2025-10-10T11:07:42.9488543Z
2025-10-10T11:07:42.9488722Z To execute this test, run the following from the base repo dir:
2025-10-10T11:07:42.9489520Z     PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_cuda_select_algorithm.py TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16
2025-10-10T11:07:42.9489525Z
2025-10-10T11:07:42.9489748Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

Got added in #161680

Pull Request resolved: #165147
Approved by: https://github.com/bbeckca

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request


          Disable failing test_int8_woq_mm_cuda on slow grad check (pytorch#165147

747a697

)

Fixes #ISSUE_NUMBER
Failing due to memory leak, ex
https://github.com/pytorch/pytorch/actions/runs/18401518298/job/52434584458

```
2025-10-10T11:07:42.9485277Z _ TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16 _
2025-10-10T11:07:42.9485389Z Traceback (most recent call last):
2025-10-10T11:07:42.9485869Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3278, in wrapper
2025-10-10T11:07:42.9485966Z     method(*args, **kwargs)
2025-10-10T11:07:42.9486365Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3278, in wrapper
2025-10-10T11:07:42.9486454Z     method(*args, **kwargs)
2025-10-10T11:07:42.9486849Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3277, in wrapper
2025-10-10T11:07:42.9486933Z     with policy():
2025-10-10T11:07:42.9487380Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2654, in __exit__
2025-10-10T11:07:42.9487473Z     raise RuntimeError(msg)
2025-10-10T11:07:42.9488533Z RuntimeError: CUDA driver API confirmed a leak in __main__.TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16! Caching allocator allocated memory was 19456 and is now reported as 29184 on device 0. CUDA driver allocated memory was 356712448 and is now 358809600.
2025-10-10T11:07:42.9488543Z
2025-10-10T11:07:42.9488722Z To execute this test, run the following from the base repo dir:
2025-10-10T11:07:42.9489520Z     PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_cuda_select_algorithm.py TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16
2025-10-10T11:07:42.9489525Z
2025-10-10T11:07:42.9489748Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

Got added in pytorch#161680

Pull Request resolved: pytorch#165147
Approved by: https://github.com/bbeckca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk fb-exported Merged meta-exported module: inductor topic: not user facing