Skip to content

Conversation

@osalpekar
Copy link
Member

@osalpekar osalpekar commented Jul 2, 2020

Stack from ghstack:

This adds a helper function to abort all NCCL Communicators associated with a WorkNCCL object

Differential Revision: D22127899

This adds a helper function to abort all NCCL Communicators associated with a WorkNCCL object

Differential Revision: [D22127899](https://our.internmc.facebook.com/intern/diff/D22127899/)

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Jul 2, 2020

💊 CI failures summary and remediations

As of commit 4af9f51 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 14 times.

@jiayisuse
Copy link
Contributor

Wondering if this helper function is being used in this stack

@osalpekar
Copy link
Member Author

Wondering if this helper function is being used in this stack

It was used in a prior version of #40946, but not anymore. Previously, we weren't overwriting the opTimeout_ with the user-supplied waitTimeout, so there was some separate logic to detect timed out WorkNCCL objects and abort associated communicators. Now that logic is simplified by just overwriting opTimeout_, though it still might be a good helper func to have.

osalpekar added 2 commits July 2, 2020 16:37
…nicators"

This adds a helper function to abort all NCCL Communicators associated with a WorkNCCL object

Differential Revision: [D22127899](https://our.internmc.facebook.com/intern/diff/D22127899/)

[ghstack-poisoned]
…nicators"

This adds a helper function to abort all NCCL Communicators associated with a WorkNCCL object

Differential Revision: [D22127899](https://our.internmc.facebook.com/intern/diff/D22127899/)

[ghstack-poisoned]
@mrshenli
Copy link
Contributor

mrshenli commented Jul 7, 2020

If this is not used, shall we move this PR out of this stack and only land it when necessary?

…nicators"

This adds a helper function to abort all NCCL Communicators associated with a WorkNCCL object

Differential Revision: [D22127899](https://our.internmc.facebook.com/intern/diff/D22127899/)

[ghstack-poisoned]
@osalpekar
Copy link
Member Author

Closing this PR since aborting NCCL Comms from the WorkNCCL-level is no longer necessary given the updated design for work timeouts

@osalpekar osalpekar closed this Jul 13, 2020
@facebook-github-bot facebook-github-bot deleted the gh/osalpekar/49/head branch August 13, 2020 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants