Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Aug 31, 2020

Stack from ghstack:

As part of addressing #23232, this PR adds support for broadcast_object_list which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

Differential Revision: D23422577

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

As part of addressing #23232, this PR adds support for `broadcast_object` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. One important note is that this API returns the communicated object instead of filling it in an argument, since assigning the object to an input value won't result in the assignment being reflected outside the function call.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

[ghstack-poisoned]
As part of addressing #23232, this PR adds support for `broadcast_object` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. One important note is that this API returns the communicated object instead of filling it in an argument, since assigning the object to an input value won't result in the assignment being reflected outside the function call.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Aug 31, 2020
Pull Request resolved: #43887

As part of addressing #23232, this PR adds support for `broadcast_object` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. One important note is that this API returns the communicated object instead of filling it in an argument, since assigning the object to an input value won't result in the assignment being reflected outside the function call.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
ghstack-source-id: 111065118

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!
.. note:: Note that this API differs slightly from the broadcast collective
since it does not provide an ``async_op`` handle and thus will be a
blocking call, and returns the object instead of setting the value in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, will it be reasonable to take an object_list instead of object, so that we can populate the result there? I am little concern about returning the object here, because we might add async_op support later which will make this API return a Future. So, if we return the result object, future changes might bring BC issues.

If we decided to use object_list as the argument, let's also rename the API to broadcast_object_list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, so if I understand the semantics will be:
Src rank calls broadcast_object_list([obj], ...)
Other ranks call broadcast_object_list([None], ...)

And after the call the passed in list would have the broadcasted object at index 0? And in general, we wouldn't have to limit this to size 1 lists, it can be any size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. For the non-source ranks, we can ignore all existing contents in the list. So it does not matter what are in the list when the API is called, we can always clear the list and then populate with results.

@dr-ci
Copy link

dr-ci bot commented Aug 31, 2020

💊 CI failures summary and remediations

As of commit f651b6a (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed


codecov.io: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 24 times.

As part of addressing #23232, this PR adds support for `broadcast_object` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. One important note is that this API returns the communicated object instead of filling it in an argument, since assigning the object to an input value won't result in the assignment being reflected outside the function call.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

[ghstack-poisoned]
As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Aug 31, 2020
Pull Request resolved: #43887

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
ghstack-source-id: 111098700

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!
@rohan-varma rohan-varma requested a review from mrshenli August 31, 2020 22:56
As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 1, 2020
Pull Request resolved: #43887

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
ghstack-source-id: 111108942

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!
As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

[ghstack-poisoned]
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Test failures are irrelevant, but shall we rebase to get test signals clear?

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

[ghstack-poisoned]
@codecov
Copy link

codecov bot commented Sep 2, 2020

Codecov Report

Merging #43887 into gh/rohan-varma/159/base will decrease coverage by 0.03%.
The diff coverage is 3.84%.

Impacted file tree graph

@@                     Coverage Diff                     @@
##           gh/rohan-varma/159/base   #43887      +/-   ##
===========================================================
- Coverage                    69.29%   69.25%   -0.04%     
===========================================================
  Files                          379      379              
  Lines                        47036    47062      +26     
===========================================================
  Hits                         32592    32592              
- Misses                       14444    14470      +26     
Impacted Files Coverage Δ
torch/distributed/distributed_c10d.py 27.33% <3.84%> (-1.01%) ⬇️
torch/testing/_internal/expecttest.py 77.55% <0.00%> (-1.03%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a67246b...f651b6a. Read the comment docs.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in fbea2ee.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/159/head branch September 5, 2020 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants