[CI][CUDA][Distributed] Update NCCL to 2.28.9 for CUDA13#168091
[CI][CUDA][Distributed] Update NCCL to 2.28.9 for CUDA13#168091nWEIdia wants to merge 7 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168091
Note: Links to docs will display an error until the docs builds have been completed. ❌ 15 New Failures, 25 Pending, 1 Unrelated FailureAs of commit 4666910 with merge base 1ee32a8 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Any reason not to update CUDA12 too? |
|
You need to create a link a test infra PR to upload the binaries and link it to the issue. See my PR for NVSHMEM as an example. |
|
Example test infra PR: pytorch/test-infra#7415 |
After taking a look, I would like to call for @atalman 's help as I am still not sure how to do this. |
|
Just change the line on 281 to update the version field in the same JSON my previous PR modified for NCCL for cu13. It will upload the binary file automatically once it's merged. |
|
CU12 should be updated to support: #168129 |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
This comment was marked as outdated.
This comment was marked as outdated.
|
@nWEIdia made a PR here: pytorch/test-infra#7516 this also uploades the NCCL CUDA12 wheels because I do recommend upgrading the 12.9 version at least as well. |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
….28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
This PR updates the NCCL version for CUDA13 from 2.27.7 to 2.28.9. 2.28.9 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.9-1 2.28.7 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.7-1 2.28.3 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.3-1 CUDA 12 remains at 2.27.5 and is untouched by this PR. Reference PR: #166174 Pull Request resolved: #168091 Approved by: https://github.com/atalman
…dged with nccl 2.28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
….28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
…dged with nccl 2.28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
….28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
|
Hi @atalman @nWEIdia pytorch/tools/optional_submodules.py Lines 30 to 35 in e51bd3c Should we
Btw, looks like .ci/docker/common/install_nccl.sh is a better script? pytorch/.ci/docker/common/install_nccl.sh Lines 1 to 15 in e51bd3c |
…68129) (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 Pull Request resolved: #168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
…torch#168129) (This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves pytorch#167682 Pull Request resolved: pytorch#168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
…dged with nccl 2.28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
….28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
…68129) (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 Pull Request resolved: #168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
…68129) (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 Pull Request resolved: #168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
…68129) (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 Pull Request resolved: #168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
…dged with nccl 2.28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
….28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
…dged with nccl 2.28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
….28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
…dged with nccl 2.28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
….28.9-1" (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 [ghstack-poisoned]
…68129) (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 Pull Request resolved: #168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
…68129) (This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves #167682 Pull Request resolved: #168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman (cherry picked from commit c907c77)
…torch#168129) (This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves pytorch#167682 Pull Request resolved: pytorch#168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
…torch#168129) (This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves pytorch#167682 Pull Request resolved: pytorch#168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
…torch#168129) (This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves pytorch#167682 Pull Request resolved: pytorch#168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
…torch#168129) (This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves pytorch#167682 Pull Request resolved: pytorch#168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
…torch#168129) (This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves pytorch#167682 Pull Request resolved: pytorch#168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
This PR updates the NCCL version for CUDA13 from 2.27.7 to 2.28.9.
2.28.9 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.9-1
2.28.7 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.7-1
2.28.3 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.3-1
CUDA 12 remains at 2.27.5 and is untouched by this PR.
cc @ezyang @gchanan @kadeng @msaroufim @eqy @ptrblck @tinglvv @Aidyn-A @malfet @atalman @ngimel @kwen2501 @Skylion007
Reference PR: #166174