Skip to content

Conversation

@H-Huang
Copy link
Member

@H-Huang H-Huang commented Aug 18, 2022

Stack from ghstack:

Changes:

Context

#86225

Differential Revision: D38839213

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 18, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 386bb80 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 18, 2022
@H-Huang H-Huang requested a review from kwen2501 August 18, 2022 16:13
@H-Huang
Copy link
Member Author

H-Huang commented Aug 18, 2022

@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

### Changes:
- Create a new Backend class which contains collectives similar to that of https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/ProcessGroup.hpp.

#### Motivation
In future PRs, the existing ProcessGroupNCCL/Gloo/UCC will be migrated to derive from this Backend class. The idea is that we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type.

Differential Revision: [D38839213](https://our.internmc.facebook.com/intern/diff/D38839213)

[ghstack-poisoned]
@H-Huang
Copy link
Member Author

H-Huang commented Aug 19, 2022

@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

### Changes:
- Create a new Backend class which contains collectives similar to that of https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/ProcessGroup.hpp.

#### Motivation
In future PRs, the existing ProcessGroupNCCL/Gloo/UCC will be migrated to derive from this Backend class. The idea is that we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type.

Differential Revision: [D38839213](https://our.internmc.facebook.com/intern/diff/D38839213)

[ghstack-poisoned]
@vadimkantorov
Copy link
Contributor

This would be nice because even in GPU training sometimes broadcasting CPU tensor is needed. And in reverse, for CPU-only inference, a different backend should be used...

Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

"Backend ",
backendName,
" does not yet support sequence numbers."));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for our record: we'd need to think whether SequenceNumber related methods should stay here or be moved to ProcessGroup. For the purpose of smooth refactorization, they can stay here for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, agree on this!

### Changes:
- Create a new Backend class which contains collectives similar to that of https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/ProcessGroup.hpp.

#### Motivation
In future PRs, the existing ProcessGroupNCCL/Gloo/UCC will be migrated to derive from this Backend class. The idea is that we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type.

Differential Revision: [D38839213](https://our.internmc.facebook.com/intern/diff/D38839213)

[ghstack-poisoned]
@H-Huang
Copy link
Member Author

H-Huang commented Aug 31, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here and land check progress here.
The merge job was triggered with the land checks (-l) flag. If you did not specify this flag yourself, you are likely enrolled in the land checks rollout. This means that your change will be merged once all checks on your PR and the land checks have passed (ETA 4 Hours). If you need to coordinate lands between different changes and cannot risk a land race, please add the ciflow/trunk label to your PR and wait for signal to complete, and then land your changes in proper order. Having trunk, pull, and Lint pre-run on a PR will bypass land checks and the ETA should be immediate. If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

pytorchmergebot pushed a commit that referenced this pull request Aug 31, 2022
### Changes:
- Create a new Backend class which contains collectives similar to that of https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/ProcessGroup.hpp.

#### Motivation
In future PRs, the existing ProcessGroupNCCL/Gloo/UCC will be migrated to derive from this Backend class. The idea is that we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type.

Differential Revision: [D38839213](https://our.internmc.facebook.com/intern/diff/D38839213)
Pull Request resolved: #83679
Approved by: https://github.com/kwen2501
facebook-github-bot pushed a commit that referenced this pull request Sep 2, 2022
Summary:
### Changes:
- Create a new Backend class which contains collectives similar to that of https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/ProcessGroup.hpp.

#### Motivation
In future PRs, the existing ProcessGroupNCCL/Gloo/UCC will be migrated to derive from this Backend class. The idea is that we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type.

Pull Request resolved: #83679
Approved by: https://github.com/kwen2501

Test Plan:
contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/693ed8b14777d1515c18653f5f8f28a602898662

Original Phabricator Test Plan:
Imported from OSS

Reviewed By: mehtanirav, kwen2501

Differential Revision: D38839213

Pulled By: H-Huang

fbshipit-source-id: 33f25f05d6c72b4ad854228c632d5bb77ab6a97b
@facebook-github-bot facebook-github-bot deleted the gh/H-Huang/73/head branch September 4, 2022 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: c10d Issues/PRs related to collective communications and process groups oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants