Skip to content

[CI][CUDA][Distributed] Update NCCL to 2.28.9 for CUDA13#168091

Closed
nWEIdia wants to merge 7 commits intopytorch:mainfrom
nWEIdia:main-bump-NCCL-to-2.28.9
Closed

[CI][CUDA][Distributed] Update NCCL to 2.28.9 for CUDA13#168091
nWEIdia wants to merge 7 commits intopytorch:mainfrom
nWEIdia:main-bump-NCCL-to-2.28.9

Conversation

@nWEIdia
Copy link
Collaborator

@nWEIdia nWEIdia commented Nov 18, 2025

This PR updates the NCCL version for CUDA13 from 2.27.7 to 2.28.9.

2.28.9 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.9-1
2.28.7 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.7-1
2.28.3 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.3-1

CUDA 12 remains at 2.27.5 and is untouched by this PR.

cc @ezyang @gchanan @kadeng @msaroufim @eqy @ptrblck @tinglvv @Aidyn-A @malfet @atalman @ngimel @kwen2501 @Skylion007

Reference PR: #166174

@nWEIdia nWEIdia requested review from a team and jeffdaily as code owners November 18, 2025 17:58
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168091

Note: Links to docs will display an error until the docs builds have been completed.

❌ 15 New Failures, 25 Pending, 1 Unrelated Failure

As of commit 4666910 with merge base 1ee32a8 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Skylion007
Copy link
Collaborator

Any reason not to update CUDA12 too?

@Skylion007
Copy link
Collaborator

You need to create a link a test infra PR to upload the binaries and link it to the issue. See my PR for NVSHMEM as an example.

@Skylion007
Copy link
Collaborator

Example test infra PR: pytorch/test-infra#7415

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Nov 18, 2025

Example test infra PR: pytorch/test-infra#7415

After taking a look, I would like to call for @atalman 's help as I am still not sure how to do this.

@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 19, 2025
@Skylion007
Copy link
Collaborator

Skylion007 commented Nov 19, 2025

Just change the line on 281 to update the version field in the same JSON my previous PR modified for NCCL for cu13. It will upload the binary file automatically once it's merged.

@Skylion007
Copy link
Collaborator

CU12 should be updated to support: #168129

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Nov 24, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot

This comment was marked as outdated.

@nWEIdia nWEIdia added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Nov 24, 2025
@Skylion007
Copy link
Collaborator

Skylion007 commented Nov 25, 2025

@nWEIdia made a PR here: pytorch/test-infra#7516 this also uploades the NCCL CUDA12 wheels because I do recommend upgrading the 12.9 version at least as well.

@Skylion007
Copy link
Collaborator

Actually, nvm it's already updated according to @atalman through: #168997

@Skylion007
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

fduwjj added a commit that referenced this pull request Dec 8, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
This PR updates the NCCL version for CUDA13 from 2.27.7 to 2.28.9.

2.28.9 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.9-1
2.28.7 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.7-1
2.28.3 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.3-1

CUDA 12 remains at 2.27.5 and is untouched by this PR.

Reference PR: #166174
Pull Request resolved: #168091
Approved by: https://github.com/atalman
fduwjj added a commit that referenced this pull request Dec 8, 2025
…dged with nccl 2.28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 8, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 8, 2025
…dged with nccl 2.28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 8, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
@kwen2501
Copy link
Collaborator

kwen2501 commented Dec 9, 2025

Hi @atalman @nWEIdia
Due to a hard set to nccl-cu12.txt here, the upgrade in this PR -- which touches CUDA 13 only -- is basically a no-op for building from source.

def read_nccl_pin() -> str:
nccl_file = "nccl-cu12.txt"
if os.getenv("DESIRED_CUDA", os.getenv("CUDA_VERSION", "")).startswith("11"):
nccl_file = "nccl-cu11.txt"
nccl_pin_path = repo_root / ".ci" / "docker" / "ci_commit_pins" / nccl_file
return _read_file(nccl_pin_path)

Should we

  • fix this hard-set
  • upgrade NCCL for CUDA 12 too?

Btw, looks like .ci/docker/common/install_nccl.sh is a better script?

#!/bin/bash
set -ex
NCCL_VERSION=""
if [[ ${CUDA_VERSION:0:2} == "11" ]]; then
NCCL_VERSION=$(cat ci_commit_pins/nccl-cu11.txt)
elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then
NCCL_VERSION=$(cat ci_commit_pins/nccl-cu12.txt)
elif [[ ${CUDA_VERSION:0:2} == "13" ]]; then
NCCL_VERSION=$(cat ci_commit_pins/nccl-cu13.txt)
else
echo "Unexpected CUDA_VERSION ${CUDA_VERSION}"
exit 1
fi

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Dec 9, 2025

I recommend fixing the "build from src" logic. It was discussed with @malfet, @atalman, and @ptrblck that we would do CUDA13 only NCCL update to 2.28.9.

pytorchmergebot pushed a commit that referenced this pull request Dec 9, 2025
…68129)

(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves #167682
Pull Request resolved: #168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
skpark-rh pushed a commit to skpark-rh/pytorch that referenced this pull request Dec 10, 2025
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
fduwjj added a commit that referenced this pull request Dec 10, 2025
…dged with nccl 2.28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 10, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Dec 11, 2025
…68129)

(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves #167682
Pull Request resolved: #168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
pytorchmergebot pushed a commit that referenced this pull request Dec 11, 2025
…68129)

(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves #167682
Pull Request resolved: #168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
pytorchmergebot pushed a commit that referenced this pull request Dec 11, 2025
…68129)

(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves #167682
Pull Request resolved: #168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
fduwjj added a commit that referenced this pull request Dec 13, 2025
…dged with nccl 2.28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 13, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 13, 2025
…dged with nccl 2.28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 13, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 13, 2025
…dged with nccl 2.28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Dec 13, 2025
….28.9-1"


(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. 

Resolves #167682

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Dec 13, 2025
…68129)

(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves #167682
Pull Request resolved: #168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
pytorchbot pushed a commit that referenced this pull request Dec 13, 2025
…68129)

(This PR will be rebased on #166174) (There are other PR which updates NCCL version: #168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves #167682
Pull Request resolved: #168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman

(cherry picked from commit c907c77)
vishalgoyal316 pushed a commit to vishalgoyal316/pytorch that referenced this pull request Dec 17, 2025
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
vishalgoyal316 pushed a commit to vishalgoyal316/pytorch that referenced this pull request Dec 17, 2025
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
krastogi-in pushed a commit to krastogi-in/pytorch that referenced this pull request Jan 9, 2026
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
krastogi-in pushed a commit to krastogi-in/pytorch that referenced this pull request Jan 9, 2026
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
krastogi-in pushed a commit to krastogi-in/pytorch that referenced this pull request Jan 9, 2026
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With pytorch#1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/inductor ciflow/inductor-micro-benchmark ciflow/trunk Trigger trunk jobs on your pull request high priority Merged open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants