Skip to content

Conversation

@atalman
Copy link
Contributor

@atalman atalman commented Sep 9, 2025

Please see: https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-65-06/

Looks like our CUDA 13.0 CI tests are passing because however the driver was not updated in CI:
https://github.com/pytorch/pytorch/actions/runs/17577200229/job/49928053389#step:13:577

Please see:

/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 12080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org/ to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /var/lib/jenkins/workspace/c10/cuda/CUDAFunctions.cpp:119.)
  return torch._C._cuda_getDeviceCount() > 0

@atalman atalman requested a review from a team as a code owner September 9, 2025 21:29
@atalman atalman added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/inductor labels Sep 9, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162531

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 4 Unrelated Failures

As of commit 46a7278 with merge base e900a27 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Sep 9, 2025
@huydhn huydhn added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 9, 2025
@huydhn
Copy link
Contributor

huydhn commented Sep 10, 2025

cc @malfet You would need this to be able to start adding tests for CUDA 13.0 on CI

@atalman
Copy link
Contributor Author

atalman commented Sep 12, 2025

@huydhn yes this is needed

@atalman atalman added the keep-going Don't stop on first failure, keep running tests until the end label Sep 12, 2025
@atalman
Copy link
Contributor Author

atalman commented Sep 12, 2025

These failures look legit:
periodic / linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 2, 8, lf.linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh)
'test/test_numba_integration.py::TestNumbaIntegration::test_array_adaptor'
periodic / linux-jammy-cuda12.8-py3.10-gcc9-debug / test (default, 7, 7, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh)
'test/test_numba_integration.py::TestNumbaIntegration::test_array_adaptor'

@atalman atalman added the ci-no-td Do not run TD on this PR label Sep 12, 2025
@atalman
Copy link
Contributor Author

atalman commented Sep 12, 2025

@pytorchmergebot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased atalman-patch-2 onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout atalman-patch-2 && git pull --rebase)

@atalman
Copy link
Contributor Author

atalman commented Sep 12, 2025

Looks like this is an issue:

test_numba_integration.py::TestNumbaIntegration::test_active_device SKIPPED [0.0003s] (No multigpu) [ 12%]
test_numba_integration.py::TestNumbaIntegration::test_array_adaptor Fatal Python error: Segmentation fault

Current thread 0x00007f971bd99440 (most recent call first):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 318 in safe_cuda_api_call
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 497 in __enter__
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py", line 121 in ensure_context
  File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 135 in __enter__
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py", line 231 in _require_cuda_context
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda/api.py", line 76 in as_cuda_array
  File "/var/lib/jenkins/workspace/test/test_numba_integration.py", line 140 in test_array_adaptor
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3223 in wrapper
  File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 549 in _callTestMethod
  File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 591 in run
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3375 in _run_custom
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3405 in run
  File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 650 in __call__
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/_pytest/unittest.py", line 333 in runtest
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pluggy/_callers.py", line 121 in _multicall
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pluggy/_hooks.py", line 512 in __call__
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/_pytest/runner.py", line 341 in from_call
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pytest_rerunfailures.py", line 549 in pytest_runtest_protocol
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pluggy/_callers.py", line 121 in _multicall
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pluggy/_hooks.py", line 512 in __call__
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pluggy/_callers.py", line 121 in _multicall
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pluggy/_hooks.py", line 512 in __call__
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/_pytest/main.py", line 323 in _main
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/_pytest/main.py", line 269 in wrap_session
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pluggy/_callers.py", line 121 in _multicall
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pluggy/_hooks.py", line 512 in __call__
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/_pytest/config/__init__.py", line 166 in main
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1298 in run_tests
  File "/var/lib/jenkins/workspace/test/test_numba_integration.py", line 399 in <module>

@huydhn
Copy link
Contributor

huydhn commented Sep 12, 2025

This happened to me in the past until I found a stable driver. Why don't we try 580.82.07 as mentioned in CUDA 13.0 release note here https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html?

@atalman atalman requested a review from jeffdaily as a code owner September 12, 2025 19:16
@atalman atalman changed the title Update Nvidia driver to CUDA 13.0 compatible 580.65.06 Update Nvidia driver to CUDA 13.0 compatible 580.82.07 Sep 12, 2025
@atalman
Copy link
Contributor Author

atalman commented Sep 13, 2025

malfet pushed a commit to pytorch/test-infra that referenced this pull request Sep 15, 2025
This updates the nvidia driver to `580.82.07` to add support for CUDA
13.0 runtime.

This is similar to pytorch/pytorch#162531 but
for our entire fleet
@seemethere
Copy link
Member

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased atalman-patch-2 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout atalman-patch-2 && git pull --rebase)

@malfet malfet added the no-runner-experiments Bypass Meta/LF runner determinator label Sep 16, 2025
@malfet
Copy link
Contributor

malfet commented Sep 17, 2025

Closing in favor of #163111

@malfet malfet closed this Sep 17, 2025
@github-actions github-actions bot deleted the atalman-patch-2 branch October 18, 2025 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end no-runner-experiments Bypass Meta/LF runner determinator topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants