-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA implementations #83735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…implementations [ghstack-poisoned]
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 101e8ce (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
| // __torch_dispatch__. | ||
| m.def( | ||
| "broadcast_", | ||
| dispatch(c10::DispatchKey::CompositeExplicitAutograd, broadcast_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @bdhirsh, in the comment above, Jiewen mentioned "It's important to register the op to the CompositeExplicitAutograd key to enable __ torch_dispatch __.".
With this new change we are not specifying a default implementation or the CompositeExplicitAutograd dispatch key, is this okay? Will __ torch_dispatch __ still work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tldr is yep, it should still work fine.
(I think the thing Jiewen was referring to is that if you made your kernel CompositeImplicitAutograd, it wouldn't work with __torch_dispatch__. You're writing separate CPU + CUDA implementations though, which will work fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep sounds good
…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. [ghstack-poisoned]
…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. [ghstack-poisoned]
…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. [ghstack-poisoned]
…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. [ghstack-poisoned]
test/distributed/test_c10d_common.py
Outdated
| # correctly dispatched | ||
|
|
||
| # negative test to make sure a non-supported device fails during dispatch call | ||
| nonsupported_device = torch.device("meta") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I'm missing something but it doesn't see like there's tests with CPU and GPU tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, since this is a big change we are trying to do it piecewise. As of this PR, ProcessGroup doesn't support a list of backends, so we are still using ProcessGroupNCCL and ProcessGroupGloo to test this. Later once ProcessGroup adds that support, this test will also be updated to test the dispatching of CPU and GPU tensors
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just minor comments added. Thanks!
| // __torch_dispatch__. | ||
| m.def( | ||
| "broadcast_", | ||
| dispatch(c10::DispatchKey::CompositeExplicitAutograd, broadcast_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| int64_t timeout); | ||
|
|
||
| } // namespace ops | ||
| } // namespace c10d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we need this header file?
Other than OpsImpl.cpp, anywhere else is the header file needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that is a good point, I think we don't actually need it. Removing.
…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) [ghstack-poisoned]
…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) [ghstack-poisoned]
…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) [ghstack-poisoned]
…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) [ghstack-poisoned]
…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) [ghstack-poisoned]
…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) [ghstack-poisoned]
…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/83735
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b98af33: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) [ghstack-poisoned]
|
@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…U and CUDA implementations" ### About this PR * Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. * Add test to validate that a separate device implementation is not supported. ### About this stack In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) [ghstack-poisoned]
|
@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge |
|
@pytorchbot successfully started a merge job. Check the current status here and land check progress here. |
…implementations (#83735) ### About this PR * Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. * Add test to validate that a separate device implementation is not supported. ### About this stack In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) Pull Request resolved: #83735 Approved by: https://github.com/kwen2501
|
Nice job! This is so exciting! |
…implementations (pytorch#83735) ### About this PR * Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. * Add test to validate that a separate device implementation is not supported. ### About this stack In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) Pull Request resolved: pytorch#83735 Approved by: https://github.com/kwen2501
…implementations (#83735) ### About this PR * Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. * Add test to validate that a separate device implementation is not supported. ### About this stack In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) Pull Request resolved: #83735 Approved by: https://github.com/kwen2501
Stack from ghstack:
About this PR
Context
#86225
Differential Revision: D38876771