[3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA implementations #83735

H-Huang · 2022-08-19T14:21:27Z

Stack from ghstack:

[8/N] [Dispatchable Collectives] Update allgather with CPU / CUDA implementations #84423 [8/N] [Dispatchable Collectives] Update allgather with CPU / CUDA implementations
[7/N] [Dispatchable Collectives] Update reduce with CPU / CUDA implementations #83916 [7/N] [Dispatchable Collectives] Update reduce with CPU / CUDA implementations
[6/N] [Dispatchable Collectives] Update recv with CPU / CUDA implementations #83876 [6/N] [Dispatchable Collectives] Update recv with CPU / CUDA implementations
[5/N] [Dispatchable Collectives] Update send with CPU / CUDA implementations #83859 [5/N] [Dispatchable Collectives] Update send with CPU / CUDA implementations
[4/N] [Dispatchable Collectives] Update all_reduce_ with CPU / CUDA implementations #83810 [4/N] [Dispatchable Collectives] Update all_reduce_ with CPU / CUDA implementations
[3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA implementations #83735 [3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA implementations

About this PR

Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
Add test to validate that a separate device implementation is not supported.

Context

#86225

Differential Revision: D38876771

…implementations [ghstack-poisoned]

facebook-github-bot · 2022-08-19T14:21:35Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/83735
✖️ Python docs build was skipped
✖️ C++ docs build was skipped
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit 101e8ce (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

H-Huang · 2022-08-19T14:30:03Z

torch/csrc/distributed/c10d/Ops.cpp

  // __torch_dispatch__.
  m.def(
-      "broadcast_",
-      dispatch(c10::DispatchKey::CompositeExplicitAutograd, broadcast_));


Hey @bdhirsh, in the comment above, Jiewen mentioned "It's important to register the op to the CompositeExplicitAutograd key to enable __ torch_dispatch __.".

With this new change we are not specifying a default implementation or the CompositeExplicitAutograd dispatch key, is this okay? Will __ torch_dispatch __ still work?

tldr is yep, it should still work fine.

(I think the thing Jiewen was referring to is that if you made your kernel CompositeImplicitAutograd, it wouldn't work with __torch_dispatch__. You're writing separate CPU + CUDA implementations though, which will work fine)

@H-Huang nit: shall we update the comment here given @bdhirsh's comment?

Yep sounds good

…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. [ghstack-poisoned]

…implementations ghstack-source-id: 9319c21 Pull Request resolved: #83735

…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. [ghstack-poisoned]

…implementations ghstack-source-id: c841997 Pull Request resolved: #83735

…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. [ghstack-poisoned]

…implementations ghstack-source-id: f69bfc7 Pull Request resolved: #83735

…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. [ghstack-poisoned]

rohan-varma · 2022-08-22T17:30:39Z

test/distributed/test_c10d_common.py

+        # correctly dispatched
+
+        # negative test to make sure a non-supported device fails during dispatch call
+        nonsupported_device = torch.device("meta")


maybe I'm missing something but it doesn't see like there's tests with CPU and GPU tensors?

Right, since this is a big change we are trying to do it piecewise. As of this PR, ProcessGroup doesn't support a list of backends, so we are still using ProcessGroupNCCL and ProcessGroupGloo to test this. Later once ProcessGroup adds that support, this test will also be updated to test the dispatching of CPU and GPU tensors

kwen2501

LGTM! Just minor comments added. Thanks!

kwen2501 · 2022-08-31T06:19:30Z

torch/csrc/distributed/c10d/Ops.cpp

  // __torch_dispatch__.
  m.def(
-      "broadcast_",
-      dispatch(c10::DispatchKey::CompositeExplicitAutograd, broadcast_));


@H-Huang nit: shall we update the comment here given @bdhirsh's comment?

kwen2501 · 2022-08-31T06:24:15Z

torch/csrc/distributed/c10d/OpsImpl.hpp

+    int64_t timeout);
+
+} // namespace ops
+} // namespace c10d


nit: do we need this header file?
Other than OpsImpl.cpp, anywhere else is the header file needed?

Oh, that is a good point, I think we don't actually need it. Removing.

…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) [ghstack-poisoned]

pytorch-bot · 2022-09-13T17:23:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/83735

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b98af33:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…U and CUDA implementations" ### Changes - Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. - Add test to validate that a separate device implementation is not supported. #### Motivation In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) [ghstack-poisoned]

H-Huang · 2022-09-14T13:30:55Z

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torch/csrc/distributed/c10d/Ops.cpp

…U and CUDA implementations" ### About this PR * Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. * Add test to validate that a separate device implementation is not supported. ### About this stack In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) [ghstack-poisoned]

H-Huang · 2022-09-28T00:36:36Z

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

H-Huang · 2022-09-28T00:50:22Z

@pytorchbot merge

pytorchmergebot · 2022-09-28T00:51:45Z

@pytorchbot successfully started a merge job. Check the current status here and land check progress here.
The merge job was triggered with the land checks (-l) flag. If you did not specify this flag yourself, you are likely enrolled in the land checks rollout. This means that your change will be merged once all checks on your PR and the land checks have passed (ETA 4 Hours). If you need to coordinate lands between different changes and cannot risk a land race, please add the ciflow/trunk label to your PR and wait for signal to complete, and then land your changes in proper order. Having trunk, pull, and Lint pre-run on a PR will bypass land checks and the ETA should be immediate. If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

…implementations (#83735) ### About this PR * Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. * Add test to validate that a separate device implementation is not supported. ### About this stack In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) Pull Request resolved: #83735 Approved by: https://github.com/kwen2501

kwen2501 · 2022-09-28T17:28:25Z

Nice job! This is so exciting!

…implementations (pytorch#83735) ### About this PR * Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. * Add test to validate that a separate device implementation is not supported. ### About this stack In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) Pull Request resolved: pytorch#83735 Approved by: https://github.com/kwen2501

…implementations (#83735) ### About this PR * Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op. * Add test to validate that a separate device implementation is not supported. ### About this stack In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively. Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771) Pull Request resolved: #83735 Approved by: https://github.com/kwen2501

[3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA …

6d349e1

…implementations [ghstack-poisoned]

H-Huang requested review from awgu, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners August 19, 2022 14:21

This was referenced Aug 19, 2022

[1/N] [Dispatchable Collectives] Create Backend class #83679

Closed

[2/N] [Dispatchable Collectives] Extract ProcessGroup::Work into a separate class and update references #83680

Closed

facebook-github-bot added the cla signed label Aug 19, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 19, 2022

H-Huang requested review from bdhirsh and kwen2501 August 19, 2022 14:25

H-Huang added module: c10d Issues/PRs related to collective communications and process groups release notes: distributed (c10d) release notes category topic: new features topic category labels Aug 19, 2022

H-Huang commented Aug 19, 2022

View reviewed changes

H-Huang mentioned this pull request Aug 19, 2022

lintrunner flake 8 failure suo/lintrunner#13

Closed

H-Huang added a commit that referenced this pull request Aug 19, 2022

[3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA …

7b0bdff

…implementations ghstack-source-id: 9319c21 Pull Request resolved: #83735

H-Huang added a commit that referenced this pull request Aug 19, 2022

[3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA …

467eae2

…implementations ghstack-source-id: c841997 Pull Request resolved: #83735

H-Huang added a commit that referenced this pull request Aug 20, 2022

[3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA …

c2b8c36

…implementations ghstack-source-id: f69bfc7 Pull Request resolved: #83735

H-Huang mentioned this pull request Aug 20, 2022

[4/N] [Dispatchable Collectives] Update all_reduce_ with CPU / CUDA implementations #83810

Closed

rohan-varma reviewed Aug 22, 2022

View reviewed changes

H-Huang mentioned this pull request Aug 22, 2022

[5/N] [Dispatchable Collectives] Update send with CPU / CUDA implementations #83859

Closed

This was referenced Aug 22, 2022

[6/N] [Dispatchable Collectives] Update recv with CPU / CUDA implementations #83876

Closed

[7/N] [Dispatchable Collectives] Update reduce with CPU / CUDA implementations #83916

Closed

kwen2501 approved these changes Aug 31, 2022

View reviewed changes

H-Huang added 2 commits August 31, 2022 11:42

H-Huang added 3 commits August 31, 2022 14:41

H-Huang mentioned this pull request Sep 1, 2022

[8/N] [Dispatchable Collectives] Update allgather with CPU / CUDA implementations #84423

Closed

H-Huang added 2 commits September 7, 2022 12:03

H-Huang mentioned this pull request Sep 14, 2022

[PT_BREAK] #83735 CI job failure due to TestPjRtDistributedDataParallel pytorch/xla#4005

Closed

alanwaketan reviewed Sep 16, 2022

View reviewed changes

torch/csrc/distributed/c10d/Ops.cpp Show resolved Hide resolved

pytorchmergebot added the Merged label Sep 28, 2022

pytorchmergebot closed this in ccac8d1 Sep 28, 2022

facebook-github-bot deleted the gh/H-Huang/75/head branch October 1, 2022 14:19

[3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA implementations #83735

[3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA implementations #83735

Uh oh!

Conversation

H-Huang commented Aug 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About this PR

Context

Uh oh!

facebook-github-bot commented Aug 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pytorch-bot bot commented Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/83735

✅ No Failures

Uh oh!

H-Huang commented Sep 14, 2022

Uh oh!

Uh oh!

H-Huang commented Sep 28, 2022

Uh oh!

H-Huang commented Sep 28, 2022

Uh oh!

pytorchmergebot commented Sep 28, 2022

Uh oh!

kwen2501 commented Sep 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

H-Huang commented Aug 19, 2022 •

edited

Loading

facebook-github-bot commented Aug 19, 2022 •

edited

Loading

pytorch-bot bot commented Sep 13, 2022 •

edited

Loading