[NCCL] Timeout Loop Thread for Async Error Handling #41050

osalpekar · 2020-07-07T00:08:49Z

Stack from ghstack:

[NCCL] Add Environment Variable to guard Async Error Handling feature #44163 [NCCL] Add Environment Variable to guard Async Error Handling feature
[NCCL] Destructor Blocks on WorkNCCL Completion #41054 [NCCL] Destructor Blocks on WorkNCCL Completion
[NCCL] WorkNCCL Helper Functions #41053 [NCCL] WorkNCCL Helper Functions
[NCCL] Abort Errored and Timed Out NCCL Communicators from Watchdog Thread #41052 [NCCL] Abort Errored and Timed Out NCCL Communicators from Watchdog Thread
[NCCL] Use cudaEventQuery to Poll for GPU operation errors #41051 [NCCL] Use cudaEventQuery to Poll for GPU operation errors
[NCCL] Timeout Loop Thread for Async Error Handling #41050 [NCCL] Timeout Loop Thread for Async Error Handling

This Commit:
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.

This Stack:
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: D21916637

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Initial code for adding WorkNCCL objects to a vector when any NCCL Collective operation is called (similar to ProcessGroupGloo), and a timeout thread that busy-polls the vector of WorkNCCL objects and removes them upon completion. Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)! [ghstack-poisoned]

dr-ci · 2020-07-07T00:10:32Z

💊 CI failures summary and remediations

As of commit 8f07a92 (more details on the Dr. CI page):

2/3 failures possibly* introduced in this PR
- 1/2 non-CircleCI failure(s)
1/3 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_macos_10_13_py3_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Sep 08 15:40:33 [E request_callback_no_python.cpp:619] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future

Sep 08 15:40:33 At: 
Sep 08 15:40:33   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 08 15:40:33   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 08 15:40:33  
Sep 08 15:40:33 [E request_callback_no_python.cpp:619] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 08 15:40:33  
Sep 08 15:40:33 At: 
Sep 08 15:40:33   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 08 15:40:33   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 08 15:40:33  
Sep 08 15:40:33 [E request_callback_no_python.cpp:619] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 08 15:40:33  
Sep 08 15:40:33 At: 
Sep 08 15:40:33   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 08 15:40:33   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 08 15:40:33  
Sep 08 15:40:33 ok (1.535s) 
Sep 08 15:40:35   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.488s) 
Sep 08 15:40:36   test_return_local_rrefs (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.552s) 
Sep 08 15:40:38   test_rpc_profiling_remote_record_function (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.432s) 
Sep 08 15:40:39   test_rpc_return_rref (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.585s)

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Sep 08 22:48:27 RuntimeError: Process 0 terminated or timed out after 100.09387135505676 seconds

Sep 08 22:48:27 ====================================================================== 
Sep 08 22:48:27 ERROR [100.115s]: test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient (__main__.TestDistBackend) 
Sep 08 22:48:27 ---------------------------------------------------------------------- 
Sep 08 22:48:27 Traceback (most recent call last): 
Sep 08 22:48:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 224, in wrapper 
Sep 08 22:48:27     self._join_processes(fn) 
Sep 08 22:48:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 337, in _join_processes 
Sep 08 22:48:27     self._check_return_codes(elapsed_time) 
Sep 08 22:48:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 375, in _check_return_codes 
Sep 08 22:48:27     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time)) 
Sep 08 22:48:27 RuntimeError: Process 0 terminated or timed out after 100.09387135505676 seconds 
Sep 08 22:48:27  
Sep 08 22:48:27 ====================================================================== 
Sep 08 22:48:27 FAIL [0.224s]: test_DistributedDataParallel_SyncBatchNorm_2D_Input (__main__.TestDistBackend) 
Sep 08 22:48:27 ---------------------------------------------------------------------- 
Sep 08 22:48:27 Traceback (most recent call last): 
Sep 08 22:48:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 224, in wrapper 
Sep 08 22:48:27     self._join_processes(fn) 
Sep 08 22:48:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 337, in _join_processes 
Sep 08 22:48:27     self._check_return_codes(elapsed_time) 
Sep 08 22:48:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 389, in _check_return_codes

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm3.7-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 57 times.

Initial code for adding WorkNCCL objects to a vector when any NCCL Collective operation is called (similar to ProcessGroupGloo), and a timeout thread that busy-polls the vector of WorkNCCL objects and removes them upon completion. Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)! [ghstack-poisoned]

torch/lib/c10d/ProcessGroupNCCL.cpp

Initial code for adding WorkNCCL objects to a vector when any NCCL Collective operation is called (similar to ProcessGroupGloo), and a timeout thread that busy-polls the vector of WorkNCCL objects and removes them upon completion. Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)! [ghstack-poisoned]

**This Commit:** We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)! [ghstack-poisoned]

torch/lib/c10d/ProcessGroupNCCL.hpp

torch/lib/c10d/ProcessGroupNCCL.cpp

**This Commit:** We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)! [ghstack-poisoned]

pritamdamania87

There seem to be some DDP tests failing, please check those before landing.

**This Commit:** We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)! [ghstack-poisoned]

facebook-github-bot · 2020-09-09T20:18:15Z

This pull request has been merged in 1df24fd.

Pull Request resolved: pytorch/pytorch#41050 **This Commit:** We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)! ghstack-source-id: 111301603

osalpekar requested review from mrshenli, pietern and zhaojuanmao as code owners July 7, 2020 00:08

osalpekar added 2 commits July 6, 2020 17:40

osalpekar mentioned this pull request Aug 18, 2020

[NCCL] use cudaEventQuery instead of cudaStreamAddCallback to catch NCCL errors #43232

Closed

mrshenli reviewed Aug 19, 2020

View reviewed changes

torch/lib/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved

torch/lib/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved

torch/lib/c10d/ProcessGroupNCCL.cpp Show resolved Hide resolved

osalpekar added 3 commits August 19, 2020 14:40

jiayisuse reviewed Aug 21, 2020

View reviewed changes

torch/lib/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved

torch/lib/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved

torch/lib/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved

torch/lib/c10d/ProcessGroupNCCL.cpp Show resolved Hide resolved

osalpekar added 2 commits August 26, 2020 17:32

pritamdamania87 approved these changes Sep 2, 2020

View reviewed changes

osalpekar added 3 commits September 2, 2020 14:07

osalpekar mentioned this pull request Sep 4, 2020

[NCCL] Add Environment Variable to guard Async Error Handling feature #44163

Closed

osalpekar mentioned this pull request Sep 9, 2020

Investigate SyncBatchNorm cleanup with NCCL Async Error Handling #44403

Open

facebook-github-bot closed this in 1df24fd Sep 9, 2020

facebook-github-bot added the merged label Sep 9, 2020

facebook-github-bot deleted the gh/osalpekar/53/head branch September 13, 2020 14:16

osalpekar mentioned this pull request Oct 27, 2020

[RFC] Asynchronous Error Handling for Distributed Training with NCCL #46874

Closed

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NCCL] Timeout Loop Thread for Async Error Handling #41050

[NCCL] Timeout Loop Thread for Async Error Handling #41050

Uh oh!

osalpekar commented Jul 7, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Jul 7, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pritamdamania87 left a comment

Uh oh!

facebook-github-bot commented Sep 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[NCCL] Timeout Loop Thread for Async Error Handling #41050

[NCCL] Timeout Loop Thread for Async Error Handling #41050

Uh oh!

Conversation

osalpekar commented Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_macos_10_13_py3_test (1/1)

❄️ 1 failure tentatively classified as flaky

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

ci.pytorch.org: 1 failed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pritamdamania87 left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Sep 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

osalpekar commented Jul 7, 2020 •

edited

Loading

dr-ci bot commented Jul 7, 2020 •

edited

Loading