-
Notifications
You must be signed in to change notification settings - Fork 26.3k
broadcast_object API for c10d #43887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As part of addressing #23232, this PR adds support for `broadcast_object` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. One important note is that this API returns the communicated object instead of filling it in an argument, since assigning the object to an input value won't result in the assignment being reflected outside the function call. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]
As part of addressing #23232, this PR adds support for `broadcast_object` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. One important note is that this API returns the communicated object instead of filling it in an argument, since assigning the object to an input value won't result in the assignment being reflected outside the function call. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]
Pull Request resolved: #43887 As part of addressing #23232, this PR adds support for `broadcast_object` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. One important note is that this API returns the communicated object instead of filling it in an argument, since assigning the object to an input value won't result in the assignment being reflected outside the function call. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. ghstack-source-id: 111065118 Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!
| .. note:: Note that this API differs slightly from the broadcast collective | ||
| since it does not provide an ``async_op`` handle and thus will be a | ||
| blocking call, and returns the object instead of setting the value in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question, will it be reasonable to take an object_list instead of object, so that we can populate the result there? I am little concern about returning the object here, because we might add async_op support later which will make this API return a Future. So, if we return the result object, future changes might bring BC issues.
If we decided to use object_list as the argument, let's also rename the API to broadcast_object_list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, so if I understand the semantics will be:
Src rank calls broadcast_object_list([obj], ...)
Other ranks call broadcast_object_list([None], ...)
And after the call the passed in list would have the broadcasted object at index 0? And in general, we wouldn't have to limit this to size 1 lists, it can be any size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. For the non-source ranks, we can ignore all existing contents in the list. So it does not matter what are in the list when the API is called, we can always clear the list and then populate with results.
💊 CI failures summary and remediationsAs of commit f651b6a (more details on the Dr. CI page):
Extra GitHub checks: 1 failed
codecov.io: 1 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 24 times. |
As part of addressing #23232, this PR adds support for `broadcast_object` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. One important note is that this API returns the communicated object instead of filling it in an argument, since assigning the object to an input value won't result in the assignment being reflected outside the function call. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]
As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]
Pull Request resolved: #43887 As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. ghstack-source-id: 111098700 Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!
As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]
Pull Request resolved: #43887 As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. ghstack-source-id: 111108942 Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!
As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Test failures are irrelevant, but shall we rebase to get test signals clear?
As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]
Codecov Report
@@ Coverage Diff @@
## gh/rohan-varma/159/base #43887 +/- ##
===========================================================
- Coverage 69.29% 69.25% -0.04%
===========================================================
Files 379 379
Lines 47036 47062 +26
===========================================================
Hits 32592 32592
- Misses 14444 14470 +26
Continue to review full report at Codecov.
|
|
This pull request has been merged in fbea2ee. |
Stack from ghstack:
As part of addressing #23232, this PR adds support for
broadcast_object_listwhich is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this.The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.
Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
Differential Revision: D23422577
NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!