broadcast_object API for c10d #43887

rohan-varma · 2020-08-31T17:59:37Z

Stack from ghstack:

[Docs] Add examples for new object-based c10d APIs #43932 [Docs] Add examples for new object-based c10d APIs
scatter_object_list API for c10d #43930 scatter_object_list API for c10d
broadcast_object API for c10d #43887 broadcast_object API for c10d

As part of addressing #23232, this PR adds support for broadcast_object_list which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

Differential Revision: D23422577

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

As part of addressing #23232, this PR adds support for `broadcast_object` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. One important note is that this API returns the communicated object instead of filling it in an argument, since assigning the object to an input value won't result in the assignment being reflected outside the function call. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]

Pull Request resolved: #43887 As part of addressing #23232, this PR adds support for `broadcast_object` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. One important note is that this API returns the communicated object instead of filling it in an argument, since assigning the object to an input value won't result in the assignment being reflected outside the function call. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. ghstack-source-id: 111065118 Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

mrshenli · 2020-08-31T18:31:07Z

torch/distributed/distributed_c10d.py

+
+    .. note:: Note that this API differs slightly from the broadcast collective
+        since it does not provide an ``async_op`` handle and thus will be a
+        blocking call, and returns the object instead of setting the value in


Question, will it be reasonable to take an object_list instead of object, so that we can populate the result there? I am little concern about returning the object here, because we might add async_op support later which will make this API return a Future. So, if we return the result object, future changes might bring BC issues.

If we decided to use object_list as the argument, let's also rename the API to broadcast_object_list.

Sure, so if I understand the semantics will be:
Src rank calls broadcast_object_list([obj], ...)
Other ranks call broadcast_object_list([None], ...)

And after the call the passed in list would have the broadcasted object at index 0? And in general, we wouldn't have to limit this to size 1 lists, it can be any size.

Yep. For the non-source ranks, we can ignore all existing contents in the list. So it does not matter what are in the list when the API is called, we can always clear the list and then populate with results.

dr-ci · 2020-08-31T19:32:58Z

💊 CI failures summary and remediations

As of commit f651b6a (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed

Failed: Codecov - codecov/patch

codecov.io: 1 failed

Failed: codecov/patch

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 24 times.

As part of addressing #23232, this PR adds support for `broadcast_object` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. One important note is that this API returns the communicated object instead of filling it in an argument, since assigning the object to an input value won't result in the assignment being reflected outside the function call. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]

Pull Request resolved: #43887 As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. ghstack-source-id: 111098700 Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]

Pull Request resolved: #43887 As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. ghstack-source-id: 111108942 Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]

mrshenli

LGTM! Test failures are irrelevant, but shall we rebase to get test signals clear?

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]

codecov · 2020-09-02T00:04:48Z

Codecov Report

Merging #43887 into gh/rohan-varma/159/base will decrease coverage by 0.03%.
The diff coverage is 3.84%.

@@                     Coverage Diff                     @@
##           gh/rohan-varma/159/base   #43887      +/-   ##
===========================================================
- Coverage                    69.29%   69.25%   -0.04%     
===========================================================
  Files                          379      379              
  Lines                        47036    47062      +26     
===========================================================
  Hits                         32592    32592              
- Misses                       14444    14470      +26

Impacted Files	Coverage Δ
torch/distributed/distributed_c10d.py	`27.33% <3.84%> (-1.01%)`	⬇️
torch/testing/_internal/expecttest.py	`77.55% <0.00%> (-1.03%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a67246b...f651b6a. Read the comment docs.

facebook-github-bot · 2020-09-02T02:12:58Z

This pull request has been merged in fbea2ee.

rohan-varma requested review from apaszke, mrshenli, pietern, pritamdamania87 and zhaojuanmao as code owners August 31, 2020 17:59

mrshenli reviewed Aug 31, 2020

View reviewed changes

rohan-varma requested a review from mrshenli August 31, 2020 22:56

This was referenced Sep 1, 2020

scatter_object_list API for c10d #43930

Closed

[Docs] Add examples for new object-based c10d APIs #43932

Closed

mrshenli approved these changes Sep 1, 2020

View reviewed changes

facebook-github-bot closed this in fbea2ee Sep 2, 2020

facebook-github-bot added the merged label Sep 2, 2020

facebook-github-bot deleted the gh/rohan-varma/159/head branch September 5, 2020 14:16

mruberry added the Merged label Oct 28, 2020

eldarkurtic mentioned this pull request Apr 16, 2021

torch.distributed.broadcast_object_list doesn't broadcast custom class object #56142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

broadcast_object API for c10d #43887

broadcast_object API for c10d #43887

Uh oh!

rohan-varma commented Aug 31, 2020 •

edited

Loading

Uh oh!

mrshenli Aug 31, 2020

Uh oh!

rohan-varma Aug 31, 2020

Uh oh!

mrshenli Aug 31, 2020

Uh oh!

dr-ci bot commented Aug 31, 2020 •

edited

Loading

Uh oh!

mrshenli left a comment

Uh oh!

codecov bot commented Sep 2, 2020 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

broadcast_object API for c10d #43887

broadcast_object API for c10d #43887

Uh oh!

Conversation

rohan-varma commented Aug 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrshenli Aug 31, 2020

Choose a reason for hiding this comment

Uh oh!

rohan-varma Aug 31, 2020

Choose a reason for hiding this comment

Uh oh!

mrshenli Aug 31, 2020

Choose a reason for hiding this comment

Uh oh!

dr-ci bot commented Aug 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Extra GitHub checks: 1 failed

codecov.io: 1 failed

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

facebook-github-bot commented Sep 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rohan-varma commented Aug 31, 2020 •

edited

Loading

dr-ci bot commented Aug 31, 2020 •

edited

Loading

codecov bot commented Sep 2, 2020 •

edited

Loading