-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[RESUBMIT] Ensure ncclCommAbort can abort stuck ncclCommInitRank #103925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RESUBMIT] Ensure ncclCommAbort can abort stuck ncclCommInitRank #103925
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103925
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 5e367b5: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@osalpekar This is a resubmit of #103264 and CI is green. Is there some specific Sandcastle configuration that was causing issues? |
|
@osalpekar has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@osalpekar Wondering if you found any issues after importing the PR? |
|
@pritamdamania87 The flakiness is on our end internally. I'll stamp this and we can merge. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot rebase |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
pytorch#95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior. However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op. To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`. I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp
|
Successfully rebased |
35663c6 to
5e367b5
Compare
|
@pytorchbot merge |
|
@osalpekar Do we need to import again? :) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check Details for Dev Infra teamRaised by workflow job |
|
@osalpekar has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Comment with id 1608768805 not found Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f "all checks passed, no internal-only changes" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pritamdamania87 Yeah looks like this needed a re-import since the PR was updated, but it's merged now :) |
#95715 added the functionality to abort
ncclCommInitRankConfigby specifyingblocking=0to enable non-blocking behavior.However, calling the
pg._abort()didn't recover from a stuckncclCommInitRankConfigsince the_abortmethod only looked throughdevNCCLCommMap_map and aborted those communicators. SincencclCommInitRankConfigwas stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result,_abortwas a no-op.To resolve this issue, I added the communicators to
inProgressCommMap_as soon as they were created and then removed them once added todevNCCLCommMap_.I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp