Skip to content

Conversation

@mingzhe09088
Copy link
Contributor

@mingzhe09088 mingzhe09088 commented Aug 28, 2020

Summary: This diff adds an option for the process group NCCL backend to pick high priority cuda streams. It lets cuda driver to prioritize NCCL kernels when there are compute kernels waiting. Here is an explanation about high priority cuda streams: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1ge2be9e9858849bf62ba4a8b66d1c3540

Test Plan: to add

Differential Revision: D23404286

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23404286

@dr-ci
Copy link

dr-ci bot commented Aug 28, 2020

💊 CI failures summary and remediations

As of commit 6e6b5af (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_macos_10_13_py3_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Sep 16 16:02:50 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future
Sep 16 16:02:50 At: 
Sep 16 16:02:50   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 16 16:02:50   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 16 16:02:50  
Sep 16 16:02:50 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 16 16:02:50  
Sep 16 16:02:50 At: 
Sep 16 16:02:50   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 16 16:02:50   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 16 16:02:50  
Sep 16 16:02:50 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 16 16:02:50  
Sep 16 16:02:50 At: 
Sep 16 16:02:50   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 16 16:02:50   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 16 16:02:50  
Sep 16 16:02:50 ok (2.114s) 
Sep 16 16:02:53   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... ok (2.065s) 
Sep 16 16:02:55   test_return_local_rrefs (__main__.ProcessGroupRpcTestWithSpawn) ... ok (2.077s) 
Sep 16 16:02:57   test_rpc_profiling_remote_record_function (__main__.ProcessGroupRpcTestWithSpawn) ... ok (2.096s) 
Sep 16 16:02:59   test_rpc_return_rref (__main__.ProcessGroupRpcTestWithSpawn) ... ok (2.087s) 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 27 times.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mingzhe09088, thanks for adding this. Could you please add some more description to the PR summary to explain the benefits of using a high priority stream?

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23404286

@codecov
Copy link

codecov bot commented Sep 2, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@5e717f0). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #43796   +/-   ##
=========================================
  Coverage          ?   69.25%           
=========================================
  Files             ?      378           
  Lines             ?    46862           
  Branches          ?        0           
=========================================
  Hits              ?    32452           
  Misses            ?    14410           
  Partials          ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e717f0...260dc54. Read the comment docs.

Comment on lines 714 to 715
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need some docs here for isHighPriority and opTimeout explaining what this means to users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only be for power users. Not sure what's a good place to add the docs. Could you suggest a place for that?

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23404286

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23404286

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23404286

Summary:
Pull Request resolved: #43796

This diff adds an option for the process group NCCL backend to pick high priority cuda streams.

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D23404286

fbshipit-source-id: 412f8216678c74d932f8143040809108d03eda79
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D23404286

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 574f9af.

xuzhao9 pushed a commit that referenced this pull request Sep 18, 2020
Summary:
Pull Request resolved: #43796

This diff adds an option for the process group NCCL backend to pick high priority cuda streams.

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D23404286

fbshipit-source-id: b79ae097b7cd945a26e8ba1dd13ad3147ac790eb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants