Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests by atalman · Pull Request #163956 · pytorch/pytorch

atalman · 2025-09-26T14:17:19Z

Workaround for #163658

Looks like the workflow passes on 12.8 build that use inux.g4dn.4xlarge.nvidia.gpu but its failing on 12.6 builds that use linux.4xlarge.nvidia.gpu: https://github.com/pytorch/pytorch/actions/runs/17953843505/job/51080623612#step:13:470

pytorch-bot · 2025-09-26T14:17:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163956

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 11 New Failures, 8 Unrelated Failures

As of commit 806008b with merge base d4e4f70 ():

NEW FAILURES - The following jobs have failed:

periodic / linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 3, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh)
inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_cuda_batch_size_17_mid_dim_1_in_features_144_out_features_1024_cuda_bfloat16
periodic / linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 4, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh)
inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_concat_cuda_batch_size_32_mid_dim_1_in_features_128_out_features_64_cuda_bfloat16
periodic / linux-jammy-cuda12.8-py3.10-gcc9-debug / test (default, 2, 7, linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh)
inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_concat_cuda_batch_size_32_mid_dim_1_in_features_128_out_features_64_cuda_bfloat16
periodic / linux-jammy-cuda12.8-py3.10-gcc9-debug / test (default, 4, 7, linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh)
inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_cuda_batch_size_17_mid_dim_1_in_features_144_out_features_1024_cuda_bfloat16
periodic / linux-jammy-cuda12.8-py3.10-gcc9-debug / test (default, 5, 7, linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh)
RuntimeError: export/test_serialize 1/1 failed!
periodic / linux-jammy-cuda12.8-py3.10-gcc9-debug / test (default, 6, 7, linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh)
inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_cuda_batch_size_17_mid_dim_1_in_features_1024_out_features_1024_cuda_bfloat16
periodic / linux-jammy-cuda12.8-py3.10-gcc9-debug / test (default, 7, 7, linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh)
inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_cuda_batch_size_1_mid_dim_1_in_features_144_out_features_64_cuda_bfloat16
s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / test (default, 1, 10, linux.s390x) (gh)
'test/test_tensor_creation_ops.py::TestTensorCreationCPU::test_cat_large_tensor_cpu_float32'
s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / test (default, 5, 10, linux.s390x) (gh)
test_nestedtensor.py::TestNestedTensorSubclassCPU::test_as_nested_tensor_from_tensor_dim_2_layout_jagged_requires_grad_False_contiguous_False_cpu_float16
s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / test (default, 6, 10, linux.s390x) (gh)
Process completed with exit code 1.
s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / test (default, 7, 10, linux.s390x) (gh)
test_nestedtensor.py::TestNestedTensorSubclassCPU::test_as_nested_tensor_from_tensor_dim_2_layout_jagged_requires_grad_False_contiguous_False_cpu_float64

FLAKY - The following job failed but was likely due to flakiness present on trunk:

periodic / linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 6, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh) (similar failure)
inductor/test_multi_kernel.py::MultiKernelTest::test_triton_relu_fused_gemm

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 1, 5, linux.g4dn.4xlarge.nvidia.gpu, unstable) (gh) (#163657)
RuntimeError: inductor/test_flex_attention 1/1 failed!
periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 2, 5, linux.g4dn.4xlarge.nvidia.gpu, unstable) (gh) (#163657)
RuntimeError: inductor/test_flex_decoding 1/1 failed!
periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 3, 5, linux.g4dn.4xlarge.nvidia.gpu, unstable) (gh) (#163657)
inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_concat_cuda_batch_size_1_mid_dim_1_in_features_128_out_features_64_cuda_bfloat16
periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 4, 5, linux.g4dn.4xlarge.nvidia.gpu, unstable) (gh) (#163657)
inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_concat_cuda_batch_size_32_mid_dim_8_in_features_128_out_features_64_cuda_bfloat16
periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 5, 5, linux.g4dn.4xlarge.nvidia.gpu, unstable) (gh) (#163657)
inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_concat_cuda_batch_size_1_mid_dim_8_in_features_128_out_features_64_cuda_bfloat16
periodic / linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 2, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh) (related job)
inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_concat_cuda_batch_size_1_mid_dim_1_in_features_128_out_features_64_cuda_bfloat16
periodic / linux-jammy-cuda12.8-py3.10-gcc9-debug / test (default, 3, 7, linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh) (related job)
inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_concat_cuda_batch_size_1_mid_dim_1_in_features_128_out_features_64_cuda_bfloat16

This comment was automatically generated by Dr. CI and updates every 15 minutes.

tinglvv · 2025-09-26T17:23:26Z

Failure in test due to build was built for sm_52 (Tesla M60) which is arch for linux.4xlarge.nvidia.gpu previously. Would need to change cuda_arch_list for 12.4 build to sm_75 as linux.g4dn.4xlarge.nvidia.gpu has sm_75 (T4).
https://github.com/pytorch/pytorch/actions/runs/18040454536/job/51350093697

 Found GPU0 Tesla T4 which is of cuda capability 7.5.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (5.2) - (5.2)

This reverts commit 20c54ed.

atalman · 2025-09-29T19:35:52Z

@pytorchmergebot merge -f "periodic failures are pre-existing"

pytorchmergebot · 2025-09-29T19:37:42Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

atalman · 2025-09-29T19:45:38Z

@pytorchbot cherry-pick --onto release/2.9 --fixes "Critical CI fix" -c critical

…163956) Workaround for #163658 Looks like the workflow passes on 12.8 build that use inux.g4dn.4xlarge.nvidia.gpu but its failing on 12.6 builds that use linux.4xlarge.nvidia.gpu: https://github.com/pytorch/pytorch/actions/runs/17953843505/job/51080623612#step:13:470 Pull Request resolved: #163956 Approved by: https://github.com/malfet Co-authored-by: Mark Saroufim <marksaroufim@meta.com> (cherry picked from commit 349c960)

pytorchbot · 2025-09-29T19:50:57Z

Cherry picking #163956

The cherry pick PR is at #164172 and it is linked with issue Critical CI fix. The following tracker issues are updated:

[v.2.9.0] Release Tracker #162497 (comment)

Details for Dev Infra team

Raised by workflow job

…164172) Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests (#163956) Workaround for #163658 Looks like the workflow passes on 12.8 build that use inux.g4dn.4xlarge.nvidia.gpu but its failing on 12.6 builds that use linux.4xlarge.nvidia.gpu: https://github.com/pytorch/pytorch/actions/runs/17953843505/job/51080623612#step:13:470 Pull Request resolved: #163956 Approved by: https://github.com/malfet (cherry picked from commit 349c960) Co-authored-by: atalman <atalman@fb.com> Co-authored-by: Mark Saroufim <marksaroufim@meta.com>

Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests

8fa88d8

atalman requested a review from a team as a code owner September 26, 2025 14:17

pytorch-bot bot added the topic: not user facing topic category label Sep 26, 2025

atalman added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Sep 26, 2025

malfet approved these changes Sep 26, 2025

View reviewed changes

atalman added the ci-no-td Do not run TD on this PR label Sep 26, 2025

atalman and others added 2 commits September 26, 2025 11:29

fix

9d79cc4

compile_kernel remove cp from graph test

878d9cf

atalman force-pushed the workaround_cuda_driver_update branch from e9f9d19 to 878d9cf Compare September 29, 2025 13:18

atalman added 2 commits September 29, 2025 06:30

Rename remaining periodic and xpu workflows py3.9->py3.10

20c54ed

Revert "Rename remaining periodic and xpu workflows py3.9->py3.10"

806008b

This reverts commit 20c54ed.

pytorchmergebot added the merging label Sep 29, 2025

pytorchmergebot closed this in 349c960 Sep 29, 2025

pytorchmergebot added Merged and removed merging labels Sep 29, 2025

pytorchbot mentioned this pull request Sep 29, 2025

[v.2.9.0] Release Tracker #162497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests#163956

Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests#163956
atalman wants to merge 5 commits intopytorch:mainfrom
atalman:workaround_cuda_driver_update

atalman commented Sep 26, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 26, 2025 •

edited

Loading

Uh oh!

tinglvv commented Sep 26, 2025 •

edited

Loading

Uh oh!

atalman commented Sep 29, 2025

Uh oh!

pytorchmergebot commented Sep 29, 2025

Uh oh!

atalman commented Sep 29, 2025

Uh oh!

pytorchbot commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

atalman commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163956

❌ 11 New Failures, 8 Unrelated Failures

Uh oh!

tinglvv commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atalman commented Sep 29, 2025

Uh oh!

pytorchmergebot commented Sep 29, 2025

Merge started

Uh oh!

atalman commented Sep 29, 2025

Uh oh!

pytorchbot commented Sep 29, 2025

Cherry picking #163956

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

atalman commented Sep 26, 2025 •

edited

Loading

pytorch-bot bot commented Sep 26, 2025 •

edited

Loading

tinglvv commented Sep 26, 2025 •

edited

Loading