Skip to content

Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests#163956

Closed
atalman wants to merge 5 commits intopytorch:mainfrom
atalman:workaround_cuda_driver_update
Closed

Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests#163956
atalman wants to merge 5 commits intopytorch:mainfrom
atalman:workaround_cuda_driver_update

Conversation

@atalman
Copy link
Contributor

@atalman atalman commented Sep 26, 2025

Workaround for #163658

Looks like the workflow passes on 12.8 build that use inux.g4dn.4xlarge.nvidia.gpu but its failing on 12.6 builds that use linux.4xlarge.nvidia.gpu: https://github.com/pytorch/pytorch/actions/runs/17953843505/job/51080623612#step:13:470

@atalman atalman requested a review from a team as a code owner September 26, 2025 14:17
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163956

Note: Links to docs will display an error until the docs builds have been completed.

❌ 11 New Failures, 8 Unrelated Failures

As of commit 806008b with merge base d4e4f70 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Sep 26, 2025
@atalman atalman added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Sep 26, 2025
@atalman atalman added the ci-no-td Do not run TD on this PR label Sep 26, 2025
@tinglvv
Copy link
Collaborator

tinglvv commented Sep 26, 2025

Failure in test due to build was built for sm_52 (Tesla M60) which is arch for linux.4xlarge.nvidia.gpu previously. Would need to change cuda_arch_list for 12.4 build to sm_75 as linux.g4dn.4xlarge.nvidia.gpu has sm_75 (T4).
https://github.com/pytorch/pytorch/actions/runs/18040454536/job/51350093697

 Found GPU0 Tesla T4 which is of cuda capability 7.5.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (5.2) - (5.2)

@atalman atalman force-pushed the workaround_cuda_driver_update branch from e9f9d19 to 878d9cf Compare September 29, 2025 13:18
@atalman
Copy link
Contributor Author

atalman commented Sep 29, 2025

@pytorchmergebot merge -f "periodic failures are pre-existing"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@atalman
Copy link
Contributor Author

atalman commented Sep 29, 2025

@pytorchbot cherry-pick --onto release/2.9 --fixes "Critical CI fix" -c critical

pytorchbot pushed a commit that referenced this pull request Sep 29, 2025
…163956)

Workaround for #163658

Looks like the workflow passes on 12.8 build that use inux.g4dn.4xlarge.nvidia.gpu but its failing on 12.6 builds that use linux.4xlarge.nvidia.gpu: https://github.com/pytorch/pytorch/actions/runs/17953843505/job/51080623612#step:13:470

Pull Request resolved: #163956
Approved by: https://github.com/malfet

Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
(cherry picked from commit 349c960)
@pytorchbot
Copy link
Collaborator

Cherry picking #163956

The cherry pick PR is at #164172 and it is linked with issue Critical CI fix. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

atalman added a commit that referenced this pull request Sep 29, 2025
…164172)

Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests (#163956)

Workaround for #163658

Looks like the workflow passes on 12.8 build that use inux.g4dn.4xlarge.nvidia.gpu but its failing on 12.6 builds that use linux.4xlarge.nvidia.gpu: https://github.com/pytorch/pytorch/actions/runs/17953843505/job/51080623612#step:13:470

Pull Request resolved: #163956
Approved by: https://github.com/malfet


(cherry picked from commit 349c960)

Co-authored-by: atalman <atalman@fb.com>
Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants