Skip to content

Conversation

@H-Huang
Copy link
Member

@H-Huang H-Huang commented Aug 19, 2022

Stack from ghstack:

About this PR

  • Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
  • Add test to validate that a separate device implementation is not supported.

Context

#86225

Differential Revision: D38876771

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 19, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 101e8ce (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 19, 2022
@H-Huang H-Huang requested review from bdhirsh and kwen2501 August 19, 2022 14:25
@H-Huang H-Huang added module: c10d Issues/PRs related to collective communications and process groups release notes: distributed (c10d) release notes category topic: new features topic category labels Aug 19, 2022
// __torch_dispatch__.
m.def(
"broadcast_",
dispatch(c10::DispatchKey::CompositeExplicitAutograd, broadcast_));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @bdhirsh, in the comment above, Jiewen mentioned "It's important to register the op to the CompositeExplicitAutograd key to enable __ torch_dispatch __.".

With this new change we are not specifying a default implementation or the CompositeExplicitAutograd dispatch key, is this okay? Will __ torch_dispatch __ still work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tldr is yep, it should still work fine.

(I think the thing Jiewen was referring to is that if you made your kernel CompositeImplicitAutograd, it wouldn't work with __torch_dispatch__. You're writing separate CPU + CUDA implementations though, which will work fine)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@H-Huang nit: shall we update the comment here given @bdhirsh's comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep sounds good

…U and CUDA implementations"


### Changes
- Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
- Add test to validate that a separate device implementation is not supported.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.


[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Aug 19, 2022
…implementations

ghstack-source-id: 9319c21
Pull Request resolved: #83735
…U and CUDA implementations"


### Changes
- Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
- Add test to validate that a separate device implementation is not supported.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.


[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Aug 19, 2022
…implementations

ghstack-source-id: c841997
Pull Request resolved: #83735
…U and CUDA implementations"


### Changes
- Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
- Add test to validate that a separate device implementation is not supported.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.


[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Aug 20, 2022
…implementations

ghstack-source-id: f69bfc7
Pull Request resolved: #83735
…U and CUDA implementations"


### Changes
- Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
- Add test to validate that a separate device implementation is not supported.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.


[ghstack-poisoned]
# correctly dispatched

# negative test to make sure a non-supported device fails during dispatch call
nonsupported_device = torch.device("meta")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I'm missing something but it doesn't see like there's tests with CPU and GPU tensors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, since this is a big change we are trying to do it piecewise. As of this PR, ProcessGroup doesn't support a list of backends, so we are still using ProcessGroupNCCL and ProcessGroupGloo to test this. Later once ProcessGroup adds that support, this test will also be updated to test the dispatching of CPU and GPU tensors

Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just minor comments added. Thanks!

// __torch_dispatch__.
m.def(
"broadcast_",
dispatch(c10::DispatchKey::CompositeExplicitAutograd, broadcast_));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@H-Huang nit: shall we update the comment here given @bdhirsh's comment?

int64_t timeout);

} // namespace ops
} // namespace c10d
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need this header file?
Other than OpsImpl.cpp, anywhere else is the header file needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that is a good point, I think we don't actually need it. Removing.

…U and CUDA implementations"


### Changes
- Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
- Add test to validate that a separate device implementation is not supported.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)

[ghstack-poisoned]
…U and CUDA implementations"


### Changes
- Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
- Add test to validate that a separate device implementation is not supported.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)

[ghstack-poisoned]
…U and CUDA implementations"


### Changes
- Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
- Add test to validate that a separate device implementation is not supported.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)

[ghstack-poisoned]
…U and CUDA implementations"


### Changes
- Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
- Add test to validate that a separate device implementation is not supported.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)

[ghstack-poisoned]
…U and CUDA implementations"


### Changes
- Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
- Add test to validate that a separate device implementation is not supported.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)

[ghstack-poisoned]
…U and CUDA implementations"


### Changes
- Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
- Add test to validate that a separate device implementation is not supported.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)

[ghstack-poisoned]
…U and CUDA implementations"


### Changes
- Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
- Add test to validate that a separate device implementation is not supported.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 13, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/83735

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b98af33:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…U and CUDA implementations"


### Changes
- Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
- Add test to validate that a separate device implementation is not supported.

#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)

[ghstack-poisoned]
@H-Huang
Copy link
Member Author

H-Huang commented Sep 14, 2022

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…U and CUDA implementations"


### About this PR
* Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
* Add test to validate that a separate device implementation is not supported.

### About this stack
In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)

[ghstack-poisoned]
@H-Huang
Copy link
Member Author

H-Huang commented Sep 28, 2022

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@H-Huang
Copy link
Member Author

H-Huang commented Sep 28, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here and land check progress here.
The merge job was triggered with the land checks (-l) flag. If you did not specify this flag yourself, you are likely enrolled in the land checks rollout. This means that your change will be merged once all checks on your PR and the land checks have passed (ETA 4 Hours). If you need to coordinate lands between different changes and cannot risk a land race, please add the ciflow/trunk label to your PR and wait for signal to complete, and then land your changes in proper order. Having trunk, pull, and Lint pre-run on a PR will bypass land checks and the ETA should be immediate. If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

pytorchmergebot pushed a commit that referenced this pull request Sep 28, 2022
…implementations (#83735)

### About this PR
* Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
* Add test to validate that a separate device implementation is not supported.

### About this stack
In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)
Pull Request resolved: #83735
Approved by: https://github.com/kwen2501
@kwen2501
Copy link
Collaborator

Nice job! This is so exciting!

drisspg pushed a commit to drisspg/pytorch that referenced this pull request Sep 29, 2022
…implementations (pytorch#83735)

### About this PR
* Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
* Add test to validate that a separate device implementation is not supported.

### About this stack
In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)
Pull Request resolved: pytorch#83735
Approved by: https://github.com/kwen2501
@facebook-github-bot facebook-github-bot deleted the gh/H-Huang/75/head branch October 1, 2022 14:19
mehtanirav pushed a commit that referenced this pull request Oct 4, 2022
…implementations (#83735)

### About this PR
* Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
* Add test to validate that a separate device implementation is not supported.

### About this stack
In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)
Pull Request resolved: #83735
Approved by: https://github.com/kwen2501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: c10d Issues/PRs related to collective communications and process groups oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category topic: new features topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants