-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[NCCL] Support Wait Timeout in ProcessGroupNCCL #40946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown. Differential Revision: [D22127898](https://our.internmc.facebook.com/intern/diff/D22127898/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 19f3fa1 (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
🚧 3 fixed upstream failures:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown. Differential Revision: [D22127898](https://our.internmc.facebook.com/intern/diff/D22127898/) [ghstack-poisoned]
Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown. Differential Revision: [D22127898](https://our.internmc.facebook.com/intern/diff/D22127898/) [ghstack-poisoned]
|
|
||
| // Same as calling synchronize() for NCCL work. | ||
| bool wait(std::chrono::milliseconds timeout = kUnsetTimeout) override; | ||
| bool wait(std::chrono::milliseconds timeout = kNoTimeout) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for using default timeout value for wait, but explicitly avoiding introduce an argument to ProcessGroupNCCL::WorkNCCL::synchronize()? Will default value also work for synchronize()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Briefly chatted with Hongyi about this - we were mainly aiming to change the wait API only without many other side-effects on user facing functions, so decided to keep synchronize as is. If you think it's better to have a timeout with default value for synchronize, we can do that as well.
Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown. Differential Revision: [D22127898](https://our.internmc.facebook.com/intern/diff/D22127898/) [ghstack-poisoned]
Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown. Differential Revision: [D22127898](https://our.internmc.facebook.com/intern/diff/D22127898/) [ghstack-poisoned]
| void ProcessGroupNCCL::WorkNCCL::synchronize() { | ||
| // Call Synchronize without a timeout. We use this method to avoid adding a | ||
| // timeout argument to the public synchronize API. | ||
| synchronizeInternal(kNoTimeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making synchronizeInternal() accepts explicit timeout
Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown. Differential Revision: [D22127898](https://our.internmc.facebook.com/intern/diff/D22127898/) [ghstack-poisoned]
Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown. Differential Revision: [D22127898](https://our.internmc.facebook.com/intern/diff/D22127898/) [ghstack-poisoned]
|
This pull request has been merged in edf3dc7. |
Stack from ghstack:
Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown.
Differential Revision: D22127898