Skip to content

Conversation

@osalpekar
Copy link
Member

@osalpekar osalpekar commented Jul 2, 2020

Stack from ghstack:

Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown.

Differential Revision: D22127898

Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown.

Differential Revision: [D22127898](https://our.internmc.facebook.com/intern/diff/D22127898/)

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Jul 2, 2020

💊 CI failures summary and remediations

As of commit 19f3fa1 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



🚧 3 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 25 times.

osalpekar added 2 commits July 2, 2020 16:37
Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown.

Differential Revision: [D22127898](https://our.internmc.facebook.com/intern/diff/D22127898/)

[ghstack-poisoned]
Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown.

Differential Revision: [D22127898](https://our.internmc.facebook.com/intern/diff/D22127898/)

[ghstack-poisoned]

// Same as calling synchronize() for NCCL work.
bool wait(std::chrono::milliseconds timeout = kUnsetTimeout) override;
bool wait(std::chrono::milliseconds timeout = kNoTimeout) override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for using default timeout value for wait, but explicitly avoiding introduce an argument to ProcessGroupNCCL::WorkNCCL::synchronize()? Will default value also work for synchronize()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Briefly chatted with Hongyi about this - we were mainly aiming to change the wait API only without many other side-effects on user facing functions, so decided to keep synchronize as is. If you think it's better to have a timeout with default value for synchronize, we can do that as well.

Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown.

Differential Revision: [D22127898](https://our.internmc.facebook.com/intern/diff/D22127898/)

[ghstack-poisoned]
Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown.

Differential Revision: [D22127898](https://our.internmc.facebook.com/intern/diff/D22127898/)

[ghstack-poisoned]
void ProcessGroupNCCL::WorkNCCL::synchronize() {
// Call Synchronize without a timeout. We use this method to avoid adding a
// timeout argument to the public synchronize API.
synchronizeInternal(kNoTimeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making synchronizeInternal() accepts explicit timeout

Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown.

Differential Revision: [D22127898](https://our.internmc.facebook.com/intern/diff/D22127898/)

[ghstack-poisoned]
Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown.

Differential Revision: [D22127898](https://our.internmc.facebook.com/intern/diff/D22127898/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in edf3dc7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants