[NCCL] Add option to run NCCL on high priority cuda stream #43796

mingzhe09088 · 2020-08-28T20:03:24Z

Summary: This diff adds an option for the process group NCCL backend to pick high priority cuda streams. It lets cuda driver to prioritize NCCL kernels when there are compute kernels waiting. Here is an explanation about high priority cuda streams: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1ge2be9e9858849bf62ba4a8b66d1c3540

Test Plan: to add

Differential Revision: D23404286

facebook-github-bot · 2020-08-28T20:03:54Z

This pull request was exported from Phabricator. Differential Revision: D23404286

dr-ci · 2020-08-28T22:18:55Z

💊 CI failures summary and remediations

As of commit 6e6b5af (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_macos_10_13_py3_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Sep 16 16:02:50 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future

Sep 16 16:02:50 At: 
Sep 16 16:02:50   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 16 16:02:50   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 16 16:02:50  
Sep 16 16:02:50 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 16 16:02:50  
Sep 16 16:02:50 At: 
Sep 16 16:02:50   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 16 16:02:50   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 16 16:02:50  
Sep 16 16:02:50 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 16 16:02:50  
Sep 16 16:02:50 At: 
Sep 16 16:02:50   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 16 16:02:50   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 16 16:02:50  
Sep 16 16:02:50 ok (2.114s) 
Sep 16 16:02:53   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... ok (2.065s) 
Sep 16 16:02:55   test_return_local_rrefs (__main__.ProcessGroupRpcTestWithSpawn) ... ok (2.077s) 
Sep 16 16:02:57   test_rpc_profiling_remote_record_function (__main__.ProcessGroupRpcTestWithSpawn) ... ok (2.096s) 
Sep 16 16:02:59   test_rpc_return_rref (__main__.ProcessGroupRpcTestWithSpawn) ... ok (2.087s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 27 times.

mrshenli

Hey @mingzhe09088, thanks for adding this. Could you please add some more description to the PR summary to explain the benefits of using a high priority stream?

torch/csrc/distributed/c10d/init.cpp

test/distributed/test_distributed.py

facebook-github-bot · 2020-08-31T21:18:41Z

This pull request was exported from Phabricator. Differential Revision: D23404286

codecov · 2020-09-02T02:36:03Z

Codecov Report

❗ No coverage uploaded for pull request base (master@5e717f0). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master   #43796   +/-   ##
=========================================
  Coverage          ?   69.25%           
=========================================
  Files             ?      378           
  Lines             ?    46862           
  Branches          ?        0           
=========================================
  Hits              ?    32452           
  Misses            ?    14410           
  Partials          ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e717f0...260dc54. Read the comment docs.

test/distributed/test_distributed.py

pritamdamania87 · 2020-09-02T20:38:55Z

torch/csrc/distributed/c10d/init.cpp

Don't we need some docs here for isHighPriority and opTimeout explaining what this means to users?

This will only be for power users. Not sure what's a good place to add the docs. Could you suggest a place for that?

torch/csrc/distributed/c10d/init.cpp

facebook-github-bot · 2020-09-03T20:22:56Z

This pull request was exported from Phabricator. Differential Revision: D23404286

torch/lib/c10d/ProcessGroupNCCL.hpp

facebook-github-bot · 2020-09-16T17:28:21Z

This pull request was exported from Phabricator. Differential Revision: D23404286

facebook-github-bot · 2020-09-16T18:34:58Z

This pull request was exported from Phabricator. Differential Revision: D23404286

Summary: Pull Request resolved: #43796 This diff adds an option for the process group NCCL backend to pick high priority cuda streams. Test Plan: waitforsandcastle Reviewed By: jiayisuse Differential Revision: D23404286 fbshipit-source-id: 412f8216678c74d932f8143040809108d03eda79

facebook-github-bot · 2020-09-16T21:11:57Z

This pull request was exported from Phabricator. Differential Revision: D23404286

facebook-github-bot · 2020-09-17T00:13:55Z

This pull request has been merged in 574f9af.

Summary: Pull Request resolved: #43796 This diff adds an option for the process group NCCL backend to pick high priority cuda streams. Test Plan: waitforsandcastle Reviewed By: jiayisuse Differential Revision: D23404286 fbshipit-source-id: b79ae097b7cd945a26e8ba1dd13ad3147ac790eb

mingzhe09088 requested review from mrshenli, pietern, pritamdamania87 and zhaojuanmao as code owners August 28, 2020 20:03

facebook-github-bot added the fb-exported label Aug 28, 2020

mrshenli reviewed Aug 31, 2020

View reviewed changes

torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved

mingzhe09088 commented Aug 31, 2020

View reviewed changes

test/distributed/test_distributed.py Outdated Show resolved Hide resolved

pritamdamania87 reviewed Sep 2, 2020

View reviewed changes

jiayisuse reviewed Sep 3, 2020

View reviewed changes

torch/lib/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved

mingzhe09088 requested a review from rohan-varma as a code owner September 16, 2020 17:28

jiayisuse approved these changes Sep 16, 2020

View reviewed changes

facebook-github-bot closed this in 574f9af Sep 16, 2020

facebook-github-bot added the merged label Sep 17, 2020

mruberry added the Merged label Oct 28, 2020

[NCCL] Add option to run NCCL on high priority cuda stream #43796

[NCCL] Add option to run NCCL on high priority cuda stream #43796

Uh oh!

Conversation

mingzhe09088 commented Aug 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Aug 28, 2020

Uh oh!

dr-ci bot commented Aug 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_macos_10_13_py3_test (1/1)

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Aug 31, 2020

Uh oh!

codecov bot commented Sep 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

pritamdamania87 Sep 2, 2020

Choose a reason for hiding this comment

Uh oh!

mingzhe09088 Sep 3, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Sep 3, 2020

Uh oh!

Uh oh!

facebook-github-bot commented Sep 16, 2020

Uh oh!

facebook-github-bot commented Sep 16, 2020

Uh oh!

facebook-github-bot commented Sep 16, 2020

Uh oh!

facebook-github-bot commented Sep 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mingzhe09088 commented Aug 28, 2020 •

edited

Loading

dr-ci bot commented Aug 28, 2020 •

edited

Loading

codecov bot commented Sep 2, 2020 •

edited

Loading