Skip to content

Conversation

@osalpekar
Copy link
Member

@osalpekar osalpekar commented Jul 7, 2020

Stack from ghstack:

This Commit:
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.

This Stack:
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: D21916637

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Initial code for adding WorkNCCL objects to a vector when any NCCL Collective operation is called (similar to ProcessGroupGloo), and a timeout thread that busy-polls the vector of WorkNCCL objects and removes them upon completion.

Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)!

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Jul 7, 2020

💊 CI failures summary and remediations

As of commit 8f07a92 (more details on the Dr. CI page):



🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_macos_10_13_py3_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Sep 08 15:40:33 [E request_callback_no_python.cpp:619] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future
Sep 08 15:40:33 At: 
Sep 08 15:40:33   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 08 15:40:33   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 08 15:40:33  
Sep 08 15:40:33 [E request_callback_no_python.cpp:619] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 08 15:40:33  
Sep 08 15:40:33 At: 
Sep 08 15:40:33   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 08 15:40:33   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 08 15:40:33  
Sep 08 15:40:33 [E request_callback_no_python.cpp:619] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 08 15:40:33  
Sep 08 15:40:33 At: 
Sep 08 15:40:33   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 08 15:40:33   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 08 15:40:33  
Sep 08 15:40:33 ok (1.535s) 
Sep 08 15:40:35   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.488s) 
Sep 08 15:40:36   test_return_local_rrefs (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.552s) 
Sep 08 15:40:38   test_rpc_profiling_remote_record_function (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.432s) 
Sep 08 15:40:39   test_rpc_return_rref (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.585s) 

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Sep 08 22:48:27 RuntimeError: Process 0 terminated or timed out after 100.09387135505676 seconds
Sep 08 22:48:27 ====================================================================== 
Sep 08 22:48:27 ERROR [100.115s]: test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient (__main__.TestDistBackend) 
Sep 08 22:48:27 ---------------------------------------------------------------------- 
Sep 08 22:48:27 Traceback (most recent call last): 
Sep 08 22:48:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 224, in wrapper 
Sep 08 22:48:27     self._join_processes(fn) 
Sep 08 22:48:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 337, in _join_processes 
Sep 08 22:48:27     self._check_return_codes(elapsed_time) 
Sep 08 22:48:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 375, in _check_return_codes 
Sep 08 22:48:27     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time)) 
Sep 08 22:48:27 RuntimeError: Process 0 terminated or timed out after 100.09387135505676 seconds 
Sep 08 22:48:27  
Sep 08 22:48:27 ====================================================================== 
Sep 08 22:48:27 FAIL [0.224s]: test_DistributedDataParallel_SyncBatchNorm_2D_Input (__main__.TestDistBackend) 
Sep 08 22:48:27 ---------------------------------------------------------------------- 
Sep 08 22:48:27 Traceback (most recent call last): 
Sep 08 22:48:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 224, in wrapper 
Sep 08 22:48:27     self._join_processes(fn) 
Sep 08 22:48:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 337, in _join_processes 
Sep 08 22:48:27     self._check_return_codes(elapsed_time) 
Sep 08 22:48:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 389, in _check_return_codes 

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 57 times.

osalpekar added 2 commits July 6, 2020 17:40
Initial code for adding WorkNCCL objects to a vector when any NCCL Collective operation is called (similar to ProcessGroupGloo), and a timeout thread that busy-polls the vector of WorkNCCL objects and removes them upon completion.

Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)!

[ghstack-poisoned]
Initial code for adding WorkNCCL objects to a vector when any NCCL Collective operation is called (similar to ProcessGroupGloo), and a timeout thread that busy-polls the vector of WorkNCCL objects and removes them upon completion.

Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)!

[ghstack-poisoned]
Initial code for adding WorkNCCL objects to a vector when any NCCL Collective operation is called (similar to ProcessGroupGloo), and a timeout thread that busy-polls the vector of WorkNCCL objects and removes them upon completion.

Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)!

[ghstack-poisoned]
**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)!

[ghstack-poisoned]
**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)!

[ghstack-poisoned]
**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)!

[ghstack-poisoned]
**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)!

[ghstack-poisoned]
Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be some DDP tests failing, please check those before landing.

**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)!

[ghstack-poisoned]
**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)!

[ghstack-poisoned]
**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)!

[ghstack-poisoned]
**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)!

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 1df24fd.

@facebook-github-bot facebook-github-bot deleted the gh/osalpekar/53/head branch September 13, 2020 14:16
loadbxh pushed a commit to loadbxh/Torch that referenced this pull request Sep 23, 2020
Pull Request resolved: pytorch/pytorch#41050

**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.


**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Differential Revision: [D21916637](https://our.internmc.facebook.com/intern/diff/D21916637/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21916637/)!
ghstack-source-id: 111301603
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants