[NCCL] Destructor Blocks on WorkNCCL Completion #41054

osalpekar · 2020-07-07T00:09:42Z

Stack from ghstack:

[NCCL] Add Environment Variable to guard Async Error Handling feature #44163 [NCCL] Add Environment Variable to guard Async Error Handling feature
[NCCL] Destructor Blocks on WorkNCCL Completion #41054 [NCCL] Destructor Blocks on WorkNCCL Completion
[NCCL] WorkNCCL Helper Functions #41053 [NCCL] WorkNCCL Helper Functions
[NCCL] Abort Errored and Timed Out NCCL Communicators from Watchdog Thread #41052 [NCCL] Abort Errored and Timed Out NCCL Communicators from Watchdog Thread
[NCCL] Use cudaEventQuery to Poll for GPU operation errors #41051 [NCCL] Use cudaEventQuery to Poll for GPU operation errors
[NCCL] Timeout Loop Thread for Async Error Handling #41050 [NCCL] Timeout Loop Thread for Async Error Handling

This Commit:
ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector.

This Stack:
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: D22054298

We should block until all WorkNCCL objects have been either aborted or completed and removed from the work vector. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]

dr-ci · 2020-07-07T00:11:33Z

💊 CI failures summary and remediations

As of commit f19707c (more details on the Dr. CI page):

1/2 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)
1/2 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Sep 08 22:37:06 RuntimeError: Process 0 terminated or timed out after 100.08557462692261 seconds

Sep 08 22:37:06 ====================================================================== 
Sep 08 22:37:06 ERROR [100.108s]: test_failure_recovery (__main__.DistributedDataParallelTest) 
Sep 08 22:37:06 ---------------------------------------------------------------------- 
Sep 08 22:37:06 Traceback (most recent call last): 
Sep 08 22:37:06   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 224, in wrapper 
Sep 08 22:37:06     self._join_processes(fn) 
Sep 08 22:37:06   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 337, in _join_processes 
Sep 08 22:37:06     self._check_return_codes(elapsed_time) 
Sep 08 22:37:06   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 375, in _check_return_codes 
Sep 08 22:37:06     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time)) 
Sep 08 22:37:06 RuntimeError: Process 0 terminated or timed out after 100.08557462692261 seconds 
Sep 08 22:37:06  
Sep 08 22:37:06 ---------------------------------------------------------------------- 
Sep 08 22:37:06 Ran 120 tests in 326.287s 
Sep 08 22:37:06  
Sep 08 22:37:06 FAILED (errors=2, skipped=9) 
Sep 08 22:37:06  
Sep 08 22:37:06 Generating XML reports... 
Sep 08 22:37:06 Generated XML report: test-reports/python-unittest/TEST-CommTest-20200908223140.xml 
Sep 08 22:37:06 Generated XML report: test-reports/python-unittest/TEST-ComputeBucketAssignmentTest-20200908223140.xml 
Sep 08 22:37:06 Generated XML report: test-reports/python-unittest/TEST-DistributedDataParallelTest-20200908223140.xml

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm3.7-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 62 times.

We should block until all WorkNCCL objects have been either aborted or completed and removed from the work vector. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]

Pull Request resolved: #41054 We should block until all WorkNCCL objects have been either aborted or completed and removed from the work vector. ghstack-source-id: 107224185 Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/)

We should block until all WorkNCCL objects have been either aborted or completed and removed from the work vector. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]

**This Commit:** ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]

pritamdamania87 · 2020-09-02T20:24:41Z

torch/lib/c10d/ProcessGroupNCCL.cpp

+    if (workList_.empty()) {
+      // Notify the main thread if it is blocked in the shutdown sequence,
+      // waiting for the work vector to become empty.
+      lock.unlock();
+      workVectorCV_.notify_one();


Do we really need to do this? Wouldn't this automatically abort when terminateProcessGroup_ is set to True? Or are we referring to some other thread here?

This notifies the CV in the destructor that is waiting for the workList_ to become empty.

pritamdamania87 · 2020-09-02T20:26:00Z

torch/lib/c10d/ProcessGroupNCCL.cpp

+  std::unique_lock<std::mutex> lock(workListMutex_);
+  // Clean up any remaining items in the workList_ instead of waiting for the
+  // workCleanup Thread to be scheduled again.
+  for (auto it = workList_.begin(); it != workList_.end();
+       /* no increment*/) {
+    auto& work = *it;
+    if (work->isCompleted()) {
+      it = workList_.erase(it);
+    } else {
+      ++it;
+    }
+  }


Why do we need to perform this explicit cleanup? Once the destructor completes, wouldn't workList_ automatically be freed?

Right after this code block, we are blocking in the destructor until the workList_ is empty (no unfinished collectives left). Ideally this cleanup would just be done in the workcleanup thread itself, but there was one corner case causing an issue here - Hongyi and I are continuing to investigate, and I'll create an issue regarding this.

Do we need to block in the destructor until workList_ is empty? How does removing completed items from workList_ help in the shutdown here?

If there are leftover WorkNCCL objects in workList_, this means there are incomplete collectives. So we block on the workList_ becoming empty to ensure that all collectives have either been completed or errored out before we destruct ProcessGroupNCCL. Ideally, the workCleanupThread will just do all of the cleanup. However, when models contain a SyncBatchNorm layer, we find that this cleanup had to occur in the destructor. Hongyi (@jiayisuse ) and I have investigated this quite a bit, and I've created a follow-up issue (#44403). We should be able to deduplicate that explicit cleanup in the destructor and let the workCleanupThread handle it completely, and I'll continue to push this as a Better Engineering task.

**This Commit:** ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]

jiayisuse · 2020-09-02T22:38:58Z

torch/lib/c10d/ProcessGroupNCCL.cpp

+    if (workList_.empty()) {
+      // Notify the main thread if it is blocked in the shutdown sequence,
+      // waiting for the work vector to become empty.
+      lock.unlock();
+      workListCV_.notify_one();
+    }


This is needed

jiayisuse · 2020-09-02T22:39:36Z

torch/lib/c10d/ProcessGroupNCCL.cpp

+  // Wait for workList_ to become empty before proceeding with shutdown.
+  workListCV_.wait(lock, [&]() -> bool { return workList_.empty(); });
+  lock.unlock();


Checked again, above code just removes completed work. So I guess we let cleanup thread to remove the unfinished works?

We're still blocking to ensure the workList is empty, so the workCleanupThread will continue looping and removing works when they are completed.

**This Commit:** ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]

facebook-github-bot · 2020-09-09T20:18:20Z

This pull request has been merged in 211ece7.

Pull Request resolved: pytorch/pytorch#41054 **This Commit:** ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 111311019 Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/)

[NCCL] Destructor Blocks on WorkNCCL Completion

a75e514

We should block until all WorkNCCL objects have been either aborted or completed and removed from the work vector. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]

osalpekar requested review from mrshenli, pietern and zhaojuanmao as code owners July 7, 2020 00:09

Update on "[NCCL] Destructor Blocks on WorkNCCL Completion"

cdd3e3f

We should block until all WorkNCCL objects have been either aborted or completed and removed from the work vector. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]

Update on "[NCCL] Destructor Blocks on WorkNCCL Completion"

3fa7ca7

We should block until all WorkNCCL objects have been either aborted or completed and removed from the work vector. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]

osalpekar mentioned this pull request Aug 18, 2020

[NCCL] use cudaEventQuery instead of cudaStreamAddCallback to catch NCCL errors #43232

Closed

osalpekar added 5 commits August 19, 2020 14:40

Update on "[NCCL] Destructor Blocks on WorkNCCL Completion"

6fd740e

We should block until all WorkNCCL objects have been either aborted or completed and removed from the work vector. Differential Revision: [D22054298](https://our.internmc.facebook.com/intern/diff/D22054298/) [ghstack-poisoned]

pritamdamania87 reviewed Sep 2, 2020

View reviewed changes

jiayisuse reviewed Sep 2, 2020

View reviewed changes

osalpekar added 3 commits September 2, 2020 17:13

osalpekar mentioned this pull request Sep 4, 2020

[NCCL] Add Environment Variable to guard Async Error Handling feature #44163

Closed

facebook-github-bot closed this in 211ece7 Sep 9, 2020

facebook-github-bot added the merged label Sep 9, 2020

facebook-github-bot deleted the gh/osalpekar/57/head branch September 13, 2020 14:17

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NCCL] Destructor Blocks on WorkNCCL Completion #41054

[NCCL] Destructor Blocks on WorkNCCL Completion #41054

Uh oh!

osalpekar commented Jul 7, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Jul 7, 2020 •

edited

Loading

Uh oh!

pritamdamania87 Sep 2, 2020

Uh oh!

osalpekar Sep 3, 2020

Uh oh!

pritamdamania87 Sep 2, 2020

Uh oh!

osalpekar Sep 2, 2020

Uh oh!

pritamdamania87 Sep 5, 2020

Uh oh!

osalpekar Sep 9, 2020

Uh oh!

jiayisuse Sep 2, 2020 •

edited

Loading

Uh oh!

jiayisuse Sep 2, 2020 •

edited

Loading

Uh oh!

osalpekar Sep 2, 2020

Uh oh!

facebook-github-bot commented Sep 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[NCCL] Destructor Blocks on WorkNCCL Completion #41054

[NCCL] Destructor Blocks on WorkNCCL Completion #41054

Uh oh!

Conversation

osalpekar commented Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

❄️ 1 failure tentatively classified as flaky

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

ci.pytorch.org: 1 failed

Uh oh!

pritamdamania87 Sep 2, 2020

Choose a reason for hiding this comment

Uh oh!

osalpekar Sep 3, 2020

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 Sep 2, 2020

Choose a reason for hiding this comment

Uh oh!

osalpekar Sep 2, 2020

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 Sep 5, 2020

Choose a reason for hiding this comment

Uh oh!

osalpekar Sep 9, 2020

Choose a reason for hiding this comment

Uh oh!

jiayisuse Sep 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiayisuse Sep 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

osalpekar Sep 2, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Sep 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

osalpekar commented Jul 7, 2020 •

edited

Loading

dr-ci bot commented Jul 7, 2020 •

edited

Loading

jiayisuse Sep 2, 2020 •

edited

Loading

jiayisuse Sep 2, 2020 •

edited

Loading