-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[1/N] [Dispatchable Collectives] Create Backend class #83679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 386bb80 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
### Changes: - Create a new Backend class which contains collectives similar to that of https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/ProcessGroup.hpp. #### Motivation In future PRs, the existing ProcessGroupNCCL/Gloo/UCC will be migrated to derive from this Backend class. The idea is that we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. Differential Revision: [D38839213](https://our.internmc.facebook.com/intern/diff/D38839213) [ghstack-poisoned]
|
@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
### Changes: - Create a new Backend class which contains collectives similar to that of https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/ProcessGroup.hpp. #### Motivation In future PRs, the existing ProcessGroupNCCL/Gloo/UCC will be migrated to derive from this Backend class. The idea is that we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. Differential Revision: [D38839213](https://our.internmc.facebook.com/intern/diff/D38839213) [ghstack-poisoned]
|
This would be nice because even in GPU training sometimes broadcasting CPU tensor is needed. And in reverse, for CPU-only inference, a different backend should be used... |
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
| "Backend ", | ||
| backendName, | ||
| " does not yet support sequence numbers.")); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for our record: we'd need to think whether SequenceNumber related methods should stay here or be moved to ProcessGroup. For the purpose of smooth refactorization, they can stay here for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, agree on this!
### Changes: - Create a new Backend class which contains collectives similar to that of https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/ProcessGroup.hpp. #### Motivation In future PRs, the existing ProcessGroupNCCL/Gloo/UCC will be migrated to derive from this Backend class. The idea is that we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. Differential Revision: [D38839213](https://our.internmc.facebook.com/intern/diff/D38839213) [ghstack-poisoned]
|
@pytorchbot merge |
|
@pytorchbot successfully started a merge job. Check the current status here and land check progress here. |
### Changes: - Create a new Backend class which contains collectives similar to that of https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/ProcessGroup.hpp. #### Motivation In future PRs, the existing ProcessGroupNCCL/Gloo/UCC will be migrated to derive from this Backend class. The idea is that we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. Differential Revision: [D38839213](https://our.internmc.facebook.com/intern/diff/D38839213) Pull Request resolved: #83679 Approved by: https://github.com/kwen2501
Summary: ### Changes: - Create a new Backend class which contains collectives similar to that of https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/ProcessGroup.hpp. #### Motivation In future PRs, the existing ProcessGroupNCCL/Gloo/UCC will be migrated to derive from this Backend class. The idea is that we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. Pull Request resolved: #83679 Approved by: https://github.com/kwen2501 Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/693ed8b14777d1515c18653f5f8f28a602898662 Original Phabricator Test Plan: Imported from OSS Reviewed By: mehtanirav, kwen2501 Differential Revision: D38839213 Pulled By: H-Huang fbshipit-source-id: 33f25f05d6c72b4ad854228c632d5bb77ab6a97b
Stack from ghstack:
Changes:
Context
#86225
Differential Revision: D38839213