-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Updates NCCL to 2.17.1 #97407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates NCCL to 2.17.1 #97407
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97407
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 FailuresAs of commit 22141bc: NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@ptrblck to review and green light the PR. |
|
Thanks much for creating the PR! |
|
@weiwangmeta should we also update internally? |
Thanks for tagging! I will check internal version and align with this PR via 3rd party update process. |
@ngimel fyi that the other internal team will take care of NCCL submodule updates according to their roadmap. Merging of this PR can be independent of internal NCCL submodule update. |
|
@pytorchbot merge |
|
This PR updates submodules third_party/nccl/nccl If those updates are intentional, please add "submodule" keyword to PR title/description. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR: Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge --help |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot merge -r |
|
@pytorchbot successfully started a rebase job. Check the current status here |
|
Successfully rebased |
380ec0e to
22141bc
Compare
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "looks like it broke inductor distributed tests https://hud.pytorch.org/pytorch/pytorch/commit/b113a09ef90decbc703722bfdc2064fc5eb54a19#12344853677" -c nosignal |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@syed-ahmed your PR has been successfully reverted. |
This reverts commit b113a09. Reverted #97407 on behalf of https://github.com/clee2000 due to looks like it broke inductor distributed tests https://hud.pytorch.org/pytorch/pytorch/commit/b113a09ef90decbc703722bfdc2064fc5eb54a19#12344853677
@clee2000 I don't think no signal is right description here: I've issued Also, why there is |
The newly opened PR uses the same commit and got tagged with ciflow/periodic, so the periodic jobs show up on hud for this PR as well. The periodic jobs I see on hud finished at 9pm, which is after I reverted this PR |
Re-open of #97407. NCCL 2.17.1 sometimes fails to send a FIN packet and causes hangs. This PR updates NCCL to 2.17.1 that includes a patch for socket shutdown. NCCL 2.18 will also include this patch. Pull Request resolved: #97843 Approved by: https://github.com/kwen2501
This PR updates NCCL submodule to 2.17.1.
Closes NVIDIA/nccl#750
cc: @ptrblck @crcrpar