-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[NCCL - reland] Explicitly abort NCCL Communicators on Process Group Destruction #40585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…Destruction This PR aborts incomplete NCCL Communicators in the ProcessGroupNCCL destructor. This should prevent pending NCCL communicators from blocking other CUDA ops. Differential Revision: [D22244873](https://our.internmc.facebook.com/intern/diff/D22244873/) [ghstack-poisoned]
…Destruction This PR aborts incomplete NCCL Communicators in the ProcessGroupNCCL destructor. This should prevent pending NCCL communicators from blocking other CUDA ops. Differential Revision: [D22244873](https://our.internmc.facebook.com/intern/diff/D22244873/) ghstack-source-id: 106633077 Pull Request resolved: #40585
…cess Group Destruction" This PR aborts incomplete NCCL Communicators in the ProcessGroupNCCL destructor. This should prevent pending NCCL communicators from blocking other CUDA ops. Differential Revision: [D22244873](https://our.internmc.facebook.com/intern/diff/D22244873/) [ghstack-poisoned]
…Destruction Pull Request resolved: #40585 This PR aborts incomplete NCCL Communicators in the ProcessGroupNCCL destructor. This should prevent pending NCCL communicators from blocking other CUDA ops. ghstack-source-id: 106988073 Differential Revision: [D22244873](https://our.internmc.facebook.com/intern/diff/D22244873/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22244873/)!
💊 CI failures summary and remediationsAs of commit c937cab (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
❄️ 1 failure tentatively classified as flakybut reruns have not yet been triggered to confirm:
|
|
This pull request has been merged in 49e12d8. |
Stack from ghstack:
This PR aborts incomplete NCCL Communicators in the ProcessGroupNCCL
destructor. This should prevent pending NCCL communicators from blocking other CUDA ops.
Differential Revision: D22244873