Skip to content

Conversation

@pritamdamania87
Copy link
Contributor

#95715 added the functionality to abort ncclCommInitRankConfig by specifying blocking=0 to enable non-blocking behavior.

However, calling the pg._abort() didn't recover from a stuck ncclCommInitRankConfig since the _abort method only looked through devNCCLCommMap_ map and aborted those communicators. Since ncclCommInitRankConfig was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, _abort was a no-op.

To resolve this issue, I added the communicators to inProgressCommMap_ as soon as they were created and then removed them once added to devNCCLCommMap_.

I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 20, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103925

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5e367b5:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Jun 20, 2023
@pritamdamania87
Copy link
Contributor Author

@osalpekar This is a resubmit of #103264 and CI is green. Is there some specific Sandcastle configuration that was causing issues?

@facebook-github-bot
Copy link
Contributor

@osalpekar has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 21, 2023
@pritamdamania87
Copy link
Contributor Author

@osalpekar Wondering if you found any issues after importing the PR?

@osalpekar
Copy link
Member

@pritamdamania87 The flakiness is on our end internally. I'll stamp this and we can merge.

@pritamdamania87
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 22, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral)

Details for Dev Infra team Raised by workflow job

@pritamdamania87
Copy link
Contributor Author

@pytorchbot rebase

@pritamdamania87
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral)

Details for Dev Infra team Raised by workflow job

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorch#95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior.

However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op.

To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`.

I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp
@pytorchmergebot
Copy link
Collaborator

Successfully rebased user/pdamania/nccl_abort onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout user/pdamania/nccl_abort && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the user/pdamania/nccl_abort branch from 35663c6 to 5e367b5 Compare June 23, 2023 04:52
@pritamdamania87
Copy link
Contributor Author

@pytorchbot merge

@pritamdamania87
Copy link
Contributor Author

@osalpekar Do we need to import again? :)

This Pull Request has changes that have not been imported to the Diff, please import the Pull Request

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team Raised by workflow job

@facebook-github-bot
Copy link
Contributor

@osalpekar has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Comment with id 1608768805 not found

Details for Dev Infra team Raised by workflow job

@osalpekar
Copy link
Member

@pytorchbot merge -f "all checks passed, no internal-only changes"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@osalpekar
Copy link
Member

@pritamdamania87 Yeah looks like this needed a re-import since the PR was updated, but it's merged now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants