-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[NCCL] Abort Errored and Timed Out NCCL Communicators from Watchdog Thread #41052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…hread Watchdog Thread checks for error-ed or timed out `WorkNCCL` objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) Differential Revision: [D21943151](https://our.internmc.facebook.com/intern/diff/D21943151/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 6c816ed (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
… Watchdog Thread" Watchdog Thread checks for error-ed or timed out `WorkNCCL` objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) Differential Revision: [D21943151](https://our.internmc.facebook.com/intern/diff/D21943151/) [ghstack-poisoned]
… Watchdog Thread" Watchdog Thread checks for error-ed or timed out `WorkNCCL` objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) Differential Revision: [D21943151](https://our.internmc.facebook.com/intern/diff/D21943151/) [ghstack-poisoned]
… Watchdog Thread" Watchdog Thread checks for error-ed or timed out `WorkNCCL` objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) Differential Revision: [D21943151](https://our.internmc.facebook.com/intern/diff/D21943151/) [ghstack-poisoned]
… Watchdog Thread" **This Commit:** Watchdog Thread checks for error-ed or timed out WorkNCCL objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D21943151](https://our.internmc.facebook.com/intern/diff/D21943151/) [ghstack-poisoned]
… Watchdog Thread" **This Commit:** Watchdog Thread checks for error-ed or timed out WorkNCCL objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D21943151](https://our.internmc.facebook.com/intern/diff/D21943151/) [ghstack-poisoned]
… Watchdog Thread" **This Commit:** Watchdog Thread checks for error-ed or timed out WorkNCCL objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D21943151](https://our.internmc.facebook.com/intern/diff/D21943151/) [ghstack-poisoned]
… Watchdog Thread" **This Commit:** Watchdog Thread checks for error-ed or timed out WorkNCCL objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D21943151](https://our.internmc.facebook.com/intern/diff/D21943151/) [ghstack-poisoned]
torch/lib/c10d/ProcessGroupNCCL.cpp
Outdated
| // We should not abort the communicators if we are performing a | ||
| // non-blocking wait(). The reason for this is that if we abort the | ||
| // nccl communicator, wait() might not throw exceptions and | ||
| // subsequent operations might run on garbage results. | ||
| // The current model is that when we call wait(), subsequent | ||
| // operations only run after this work is done or we hang forever | ||
| // waiting for the operation to complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we violate the contract mentioned here if we remove blockingWait_ here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. I guess users will get the same behavior when blockingWait_ is true. We may need to revise this block of comment, saying that the aborted nccl call will be caught by cleanup thread and cause exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If NCCL_BLOCKING_WAIT is false and NCCL_ASYNC_ERROR_HANDLING is false, we would still end up aborting communicators here that might cause consistency issues where other ops after the aborted collective might run on corrupted data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - thanks. We should probably guard this code block with if (blockingWait_ || asyncErrorHandling_) to handle this case.
| { | ||
| std::unique_lock<std::mutex> lock(workListMutex_); | ||
| for (auto& work : workList_) { | ||
| work->checkAndSetException(); | ||
| // Aborting NCCL Communicators due to errors is already handled above. | ||
| if (work->exception()) { | ||
| continue; | ||
| } | ||
|
|
||
| // Check for Timeouts in the WorkNCCL Operations, and abort all | ||
| // communicators accordingly. | ||
| auto currentTimepoint = std::chrono::steady_clock::now(); | ||
| if (std::chrono::duration_cast<std::chrono::milliseconds>( | ||
| currentTimepoint - work->workStartTime_) > work->opTimeout_) { | ||
| std::exception_ptr exception_ptr = std::make_exception_ptr( | ||
| std::runtime_error("NCCL Operation Timed Out")); | ||
| work->setException(exception_ptr); | ||
| for (const auto& ncclComm : work->ncclComms_) { | ||
| ncclComm->ncclCommAbort(); | ||
| abortedCommIds.emplace(buildNcclUniqueIdStr(ncclComm->getNcclId())); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is becoming pretty large with a multiple complex blocks, can we move each block out into separate helper functions for more clarity?
… Watchdog Thread" **This Commit:** Watchdog Thread checks for error-ed or timed out WorkNCCL objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D21943151](https://our.internmc.facebook.com/intern/diff/D21943151/) [ghstack-poisoned]
… Watchdog Thread" **This Commit:** Watchdog Thread checks for error-ed or timed out WorkNCCL objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D21943151](https://our.internmc.facebook.com/intern/diff/D21943151/) [ghstack-poisoned]
… Watchdog Thread" **This Commit:** Watchdog Thread checks for error-ed or timed out WorkNCCL objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D21943151](https://our.internmc.facebook.com/intern/diff/D21943151/) [ghstack-poisoned]
… Watchdog Thread" **This Commit:** Watchdog Thread checks for error-ed or timed out WorkNCCL objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D21943151](https://our.internmc.facebook.com/intern/diff/D21943151/) [ghstack-poisoned]
… Watchdog Thread" **This Commit:** Watchdog Thread checks for error-ed or timed out WorkNCCL objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D21943151](https://our.internmc.facebook.com/intern/diff/D21943151/) [ghstack-poisoned]
|
This pull request has been merged in f8f7b78. |
…hread Pull Request resolved: pytorch/pytorch#41052 **This Commit:** Watchdog Thread checks for error-ed or timed out `WorkNCCL` objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 111311021 Differential Revision: [D21943151](https://our.internmc.facebook.com/intern/diff/D21943151/)
Stack from ghstack:
This Commit:
Watchdog Thread checks for error-ed or timed out WorkNCCL objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.)
This Stack:
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.
Differential Revision: D21943151