-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[NCCL] Destructor Blocks on WorkNCCL Completion #41054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We should block until all WorkNCCL objects have been either aborted or completed and removed from the work vector. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit f19707c (more details on the Dr. CI page):
❄️ 1 failure tentatively classified as flakybut reruns have not yet been triggered to confirm:
|
We should block until all WorkNCCL objects have been either aborted or completed and removed from the work vector. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]
Pull Request resolved: #41054 We should block until all WorkNCCL objects have been either aborted or completed and removed from the work vector. ghstack-source-id: 107224185 Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/)
We should block until all WorkNCCL objects have been either aborted or completed and removed from the work vector. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]
We should block until all WorkNCCL objects have been either aborted or completed and removed from the work vector. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]
**This Commit:** ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]
**This Commit:** ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]
**This Commit:** ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]
**This Commit:** ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]
torch/lib/c10d/ProcessGroupNCCL.cpp
Outdated
| if (workList_.empty()) { | ||
| // Notify the main thread if it is blocked in the shutdown sequence, | ||
| // waiting for the work vector to become empty. | ||
| lock.unlock(); | ||
| workVectorCV_.notify_one(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to do this? Wouldn't this automatically abort when terminateProcessGroup_ is set to True? Or are we referring to some other thread here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This notifies the CV in the destructor that is waiting for the workList_ to become empty.
| std::unique_lock<std::mutex> lock(workListMutex_); | ||
| // Clean up any remaining items in the workList_ instead of waiting for the | ||
| // workCleanup Thread to be scheduled again. | ||
| for (auto it = workList_.begin(); it != workList_.end(); | ||
| /* no increment*/) { | ||
| auto& work = *it; | ||
| if (work->isCompleted()) { | ||
| it = workList_.erase(it); | ||
| } else { | ||
| ++it; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to perform this explicit cleanup? Once the destructor completes, wouldn't workList_ automatically be freed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right after this code block, we are blocking in the destructor until the workList_ is empty (no unfinished collectives left). Ideally this cleanup would just be done in the workcleanup thread itself, but there was one corner case causing an issue here - Hongyi and I are continuing to investigate, and I'll create an issue regarding this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to block in the destructor until workList_ is empty? How does removing completed items from workList_ help in the shutdown here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are leftover WorkNCCL objects in workList_, this means there are incomplete collectives. So we block on the workList_ becoming empty to ensure that all collectives have either been completed or errored out before we destruct ProcessGroupNCCL. Ideally, the workCleanupThread will just do all of the cleanup. However, when models contain a SyncBatchNorm layer, we find that this cleanup had to occur in the destructor. Hongyi (@jiayisuse ) and I have investigated this quite a bit, and I've created a follow-up issue (#44403). We should be able to deduplicate that explicit cleanup in the destructor and let the workCleanupThread handle it completely, and I'll continue to push this as a Better Engineering task.
**This Commit:** ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]
| if (workList_.empty()) { | ||
| // Notify the main thread if it is blocked in the shutdown sequence, | ||
| // waiting for the work vector to become empty. | ||
| lock.unlock(); | ||
| workListCV_.notify_one(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed
| // Wait for workList_ to become empty before proceeding with shutdown. | ||
| workListCV_.wait(lock, [&]() -> bool { return workList_.empty(); }); | ||
| lock.unlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked again, above code just removes completed work. So I guess we let cleanup thread to remove the unfinished works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're still blocking to ensure the workList is empty, so the workCleanupThread will continue looping and removing works when they are completed.
**This Commit:** ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]
**This Commit:** ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]
**This Commit:** ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]
**This Commit:** ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]
|
This pull request has been merged in 211ece7. |
Pull Request resolved: pytorch/pytorch#41054 **This Commit:** ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 111311019 Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/)
Stack from ghstack:
This Commit:
ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector.
This Stack:
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.
Differential Revision: D22054298