Skip to content

Conversation

@syed-ahmed
Copy link
Collaborator

@syed-ahmed syed-ahmed commented Mar 23, 2023

This PR updates NCCL submodule to 2.17.1.
Closes NVIDIA/nccl#750

cc: @ptrblck @crcrpar

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 23, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97407

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 Failures

As of commit 22141bc:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 23, 2023
@syed-ahmed
Copy link
Collaborator Author

@ptrblck to review and green light the PR.

@kwen2501
Copy link
Collaborator

Thanks much for creating the PR!

@ngimel
Copy link
Collaborator

ngimel commented Mar 23, 2023

@weiwangmeta should we also update internally?

@weiwangmeta
Copy link
Contributor

@weiwangmeta should we also update internally?

Thanks for tagging! I will check internal version and align with this PR via 3rd party update process.

@weiwangmeta
Copy link
Contributor

@weiwangmeta should we also update internally?

@ngimel fyi that the other internal team will take care of NCCL submodule updates according to their roadmap. Merging of this PR can be independent of internal NCCL submodule update.

@syed-ahmed
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 28, 2023
@pytorchmergebot
Copy link
Collaborator

This PR updates submodules third_party/nccl/nccl

If those updates are intentional, please add "submodule" keyword to PR title/description.

@malfet
Copy link
Contributor

malfet commented Mar 28, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR:
@pytorchbot rebase

Details for Dev Infra team Raised by workflow job

@malfet
Copy link
Contributor

malfet commented Mar 28, 2023

@pytorchbot merge --help

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 28, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: --help

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci} ...

Try @pytorchbot --help for more info.

@malfet
Copy link
Contributor

malfet commented Mar 28, 2023

@pytorchbot merge -r

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased bump-nccl onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout bump-nccl && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@clee2000
Copy link
Contributor

@pytorchbot revert -m "looks like it broke inductor distributed tests https://hud.pytorch.org/pytorch/pytorch/commit/b113a09ef90decbc703722bfdc2064fc5eb54a19#12344853677" -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@syed-ahmed your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Mar 28, 2023
@malfet
Copy link
Contributor

malfet commented Mar 29, 2023

@pytorchbot revert -m "looks like it broke inductor distributed tests https://hud.pytorch.org/pytorch/pytorch/commit/b113a09ef90decbc703722bfdc2064fc5eb54a19#12344853677" -c nosignal

@clee2000 I don't think no signal is right description here: I've issued merge -r and it proceeded with the merge despite the failures reported in https://hud.pytorch.org/pr/97407

Also, why there is ciflow/periodic tag despite the lack of label?

@clee2000
Copy link
Contributor

@pytorchbot revert -m "looks like it broke inductor distributed tests https://hud.pytorch.org/pytorch/pytorch/commit/b113a09ef90decbc703722bfdc2064fc5eb54a19#12344853677" -c nosignal

@clee2000 I don't think no signal is right description here: I've issued merge -r and it proceeded with the merge despite the failures reported in https://hud.pytorch.org/pr/97407

Also, why there is ciflow/periodic tag despite the lack of label?

The newly opened PR uses the same commit and got tagged with ciflow/periodic, so the periodic jobs show up on hud for this PR as well. The periodic jobs I see on hud finished at 9pm, which is after I reverted this PR

pytorchmergebot pushed a commit that referenced this pull request Apr 17, 2023
Re-open of #97407. NCCL 2.17.1 sometimes fails to send a FIN packet and causes hangs. This PR updates NCCL to 2.17.1 that includes a patch for socket shutdown. NCCL 2.18 will also include this patch.
Pull Request resolved: #97843
Approved by: https://github.com/kwen2501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NCCL Hang with CUDA_LAUNCH_BLOCKING=1

9 participants