[CI][CUDA][Distributed] Update NCCL to 2.28.9 for CUDA13 by nWEIdia · Pull Request #168091 · pytorch/pytorch

nWEIdia · 2025-11-18T17:58:45Z

This PR updates the NCCL version for CUDA13 from 2.27.7 to 2.28.9.

2.28.9 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.9-1
2.28.7 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.7-1
2.28.3 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.3-1

CUDA 12 remains at 2.27.5 and is untouched by this PR.

cc @ezyang @gchanan @kadeng @msaroufim @eqy @ptrblck @tinglvv @Aidyn-A @malfet @atalman @ngimel @kwen2501 @Skylion007

Reference PR: #166174

pytorch-bot · 2025-11-18T17:58:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168091

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 15 New Failures, 25 Pending, 1 Unrelated Failure

As of commit 4666910 with merge base 1ee32a8 ():

NEW FAILURES - The following jobs have failed:

linux-binary-manywheel / manywheel-py3_10-xpu-test (gh)
Process completed with exit code 1.
linux-binary-manywheel / manywheel-py3_11-xpu-test (gh)
Process completed with exit code 1.
linux-binary-manywheel / manywheel-py3_12-xpu-test (gh)
Process completed with exit code 1.
linux-binary-manywheel / manywheel-py3_13-xpu-test (gh)
Process completed with exit code 1.
linux-binary-manywheel / manywheel-py3_13t-xpu-test (gh)
Process completed with exit code 1.
linux-binary-manywheel / manywheel-py3_14-xpu-test (gh)
Process completed with exit code 1.
linux-binary-manywheel / manywheel-py3_14t-xpu-test (gh)
Process completed with exit code 1.
trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx942.1) (gh)
test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile
windows-binary-wheel / wheel-py3_10-xpu-test (gh)
Process completed with exit code 1.
windows-binary-wheel / wheel-py3_11-xpu-test (gh)
Process completed with exit code 1.
windows-binary-wheel / wheel-py3_12-xpu-test (gh)
Process completed with exit code 1.
windows-binary-wheel / wheel-py3_13-xpu-test (gh)
Process completed with exit code 1.
windows-binary-wheel / wheel-py3_13t-xpu-test (gh)
Process completed with exit code 1.
windows-binary-wheel / wheel-py3_14-xpu-test (gh)
Process completed with exit code 1.
windows-binary-wheel / wheel-py3_14t-xpu-test (gh)
Process completed with exit code 1.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1) (gh) (similar failure)
test/inductor/test_native_matmul.py::TestTritonDotReduction::test_matmul_fp16

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Skylion007 · 2025-11-18T18:35:48Z

Any reason not to update CUDA12 too?

Skylion007 · 2025-11-18T18:36:34Z

You need to create a link a test infra PR to upload the binaries and link it to the issue. See my PR for NVSHMEM as an example.

Skylion007 · 2025-11-18T18:56:55Z

Example test infra PR: pytorch/test-infra#7415

nWEIdia · 2025-11-18T20:56:08Z

Example test infra PR: pytorch/test-infra#7415

After taking a look, I would like to call for @atalman 's help as I am still not sure how to do this.

Skylion007 · 2025-11-19T17:56:02Z

Just change the line on 281 to update the version field in the same JSON my previous PR modified for NCCL for cu13. It will upload the binary file automatically once it's merged.

Skylion007 · 2025-11-19T18:10:20Z

CU12 should be updated to support: #168129

nWEIdia · 2025-11-24T18:05:52Z

@pytorchbot rebase

pytorchmergebot · 2025-11-24T18:07:24Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Skylion007 · 2025-11-25T19:19:32Z

@nWEIdia made a PR here: pytorch/test-infra#7516 this also uploades the NCCL CUDA12 wheels because I do recommend upgrading the 12.9 version at least as well.

Skylion007 · 2025-11-25T21:12:17Z

Actually, nvm it's already updated according to @atalman through: #168997

Skylion007 · 2025-11-27T20:43:12Z

@pytorchbot rebase

pytorchmergebot · 2025-11-27T20:44:47Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

….28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]

This PR updates the NCCL version for CUDA13 from 2.27.7 to 2.28.9. 2.28.9 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.9-1 2.28.7 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.7-1 2.28.3 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.3-1 CUDA 12 remains at 2.27.5 and is untouched by this PR. Reference PR: #166174 Pull Request resolved: #168091 Approved by: https://github.com/atalman

…dged with nccl 2.28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]

….28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]

…dged with nccl 2.28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]

….28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]

kwen2501 · 2025-12-09T00:24:06Z

Hi @atalman @nWEIdia
Due to a hard set to nccl-cu12.txt here, the upgrade in this PR -- which touches CUDA 13 only -- is basically a no-op for building from source.

pytorch/tools/optional_submodules.py

Lines 30 to 35 in e51bd3c

    
           def read_nccl_pin() -> str: 
        
               nccl_file = "nccl-cu12.txt" 
        
               if os.getenv("DESIRED_CUDA", os.getenv("CUDA_VERSION", "")).startswith("11"): 
        
                   nccl_file = "nccl-cu11.txt" 
        
               nccl_pin_path = repo_root / ".ci" / "docker" / "ci_commit_pins" / nccl_file 
        
               return _read_file(nccl_pin_path)

Should we

fix this hard-set
upgrade NCCL for CUDA 12 too?

Btw, looks like .ci/docker/common/install_nccl.sh is a better script?

pytorch/.ci/docker/common/install_nccl.sh

Lines 1 to 15 in e51bd3c

    
           #!/bin/bash 
        
           set -ex 
        
           NCCL_VERSION="" 
        
           if [[ ${CUDA_VERSION:0:2} == "11" ]]; then 
        
             NCCL_VERSION=$(cat ci_commit_pins/nccl-cu11.txt) 
        
           elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then 
        
             NCCL_VERSION=$(cat ci_commit_pins/nccl-cu12.txt) 
        
           elif [[ ${CUDA_VERSION:0:2} == "13" ]]; then 
        
             NCCL_VERSION=$(cat ci_commit_pins/nccl-cu13.txt) 
        
           else 
        
             echo "Unexpected CUDA_VERSION ${CUDA_VERSION}" 
        
             exit 1 
        
           fi

nWEIdia · 2025-12-09T02:40:55Z

I recommend fixing the "build from src" logic. It was discussed with @malfet, @atalman, and @ptrblck that we would do CUDA13 only NCCL update to 2.28.9.

…68129) (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 Pull Request resolved: #168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman

…torch#168129) (This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves pytorch#167682 Pull Request resolved: pytorch#168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman