Collective dispatching from Process Group #91257

H-Huang · 2022-12-21T18:00:51Z

Stack from ghstack (oldest at bottom):

-> Collective dispatching from Process Group #91257

Fixes #90932
Fixes #90659

Remove redundant collection operation definitions by calling the ops directly from ProcessGroup

Context:
#86225

Differential Revision: D42854676

[ghstack-poisoned]

pytorch-bot · 2022-12-21T18:00:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91257

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f005be3:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 0b6bbfc Pull Request resolved: #91257

Fixes #90932 Fixes #90659 Remove redundant collection operation definitions by calling the ops directly from `ProcessGroup` [ghstack-poisoned]

ghstack-source-id: 0e01944 Pull Request resolved: #91257

Fixes #90932 Fixes #90659 Remove redundant collection operation definitions by calling the ops directly from `ProcessGroup` [ghstack-poisoned]

ghstack-source-id: 7deb6e0 Pull Request resolved: #91257

Fixes #90932 Fixes #90659 Remove redundant collection operation definitions by calling the ops directly from `ProcessGroup` [ghstack-poisoned]

ghstack-source-id: 288c7f9 Pull Request resolved: #91257

H-Huang · 2023-01-30T19:32:40Z

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

kwen2501

Thanks for the clean-up! LGTM.
Please see my inline comment.

kwen2501 · 2023-02-08T22:30:31Z

torch/csrc/distributed/c10d/Ops.cpp

-c10::intrusive_ptr<Work> broadcast(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    at::TensorList tensors,
-    const BroadcastOptions& opts) {
-  // TODO: handles the case of using a PythonProcessGroup which is used in
-  // Reducer.cpp This can be removed once
-  // https://github.com/pytorch/pytorch/issues/90659 is resolved
-  if (!process_group->hasBackends()) {
-    auto tensor_vec = tensors.vec();
-    return process_group->broadcast(tensor_vec, opts);
-  }
-
-  static auto op =
-      c10::Dispatcher::singleton()
-          .findSchemaOrThrow("c10d::broadcast_", "")
-          .typed<std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
-              at::TensorList,
-              const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-              int64_t,
-              int64_t,
-              int64_t)>();
-  // It's awakward to unbox the opts here and box them again in the custom C++
-  // op. But it's also complicated to make opts as a CustomClassHolder. Leave it
-  // as it is now.
-  return std::get<1>(op.call(
-      tensors,
-      process_group,
-      opts.rootRank,
-      opts.rootTensor,
-      opts.timeout.count()));
-}
-
-c10::intrusive_ptr<Work> allreduce(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    at::TensorList tensors,
-    const AllreduceOptions& opts) {
-  // TODO: handles the case of using a PythonProcessGroup which is used in
-  // Reducer.cpp This can be removed once
-  // https://github.com/pytorch/pytorch/issues/90659 is resolved
-  if (!process_group->hasBackends()) {
-    auto tensor_vec = tensors.vec();
-    return process_group->allreduce(tensor_vec, opts);
-  }
-
-  static auto op =
-      c10::Dispatcher::singleton()
-          .findSchemaOrThrow("c10d::allreduce_", "")
-          .typed<std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
-              at::TensorList,
-              const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-              const c10::intrusive_ptr<::c10d::ReduceOp>&,
-              int64_t)>();
-
-  return std::get<1>(op.call(
-      tensors,
-      process_group,
-      c10::make_intrusive<ReduceOp>(opts.reduceOp),
-      opts.timeout.count()));
-}
-
-c10::intrusive_ptr<Work> allreduce_coalesced(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    at::TensorList tensors,
-    const AllreduceCoalescedOptions& opts) {
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::allreduce_coalesced_", "")
-                       .typed<c10::intrusive_ptr<::c10d::Work>(
-                           at::TensorList,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           const c10::intrusive_ptr<::c10d::ReduceOp>&,
-                           int64_t)>();
-
-  return op.call(
-      tensors,
-      process_group,
-      c10::make_intrusive<ReduceOp>(opts.reduceOp),
-      opts.timeout.count());
-}
-
-c10::intrusive_ptr<Work> allgather(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    const std::vector<std::vector<at::Tensor>>& output_tensors,
-    at::TensorList input_tensors,
-    const AllgatherOptions& opts) {
-  // TODO: handles the case of using a PythonProcessGroup which is used in
-  // Reducer.cpp This can be removed once
-  // https://github.com/pytorch/pytorch/issues/90659 is resolved
-  if (!process_group->hasBackends()) {
-    auto input_tensors_vec = input_tensors.vec();
-    return process_group->allgather(
-        const_cast<std::vector<std::vector<at::Tensor>>&>(output_tensors),
-        input_tensors_vec,
-        opts);
-  }
-
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::allgather_", "")
-                       .typed<std::tuple<
-                           std::vector<std::vector<at::Tensor>>,
-                           c10::intrusive_ptr<Work>>(
-                           const std::vector<std::vector<at::Tensor>>&,
-                           at::TensorList,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           int64_t)>();
-
-  return std::get<1>(op.call(
-      output_tensors, input_tensors, process_group, opts.timeout.count()));
-}
-
-c10::intrusive_ptr<Work> _allgather_base(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    at::Tensor& output_tensor,
-    at::Tensor& input_tensor,
-    const AllgatherOptions& opts) {
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::_allgather_base_", "")
-                       .typed<std::tuple<at::Tensor, c10::intrusive_ptr<Work>>(
-                           at::Tensor&,
-                           at::Tensor&,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&)>();
-
-  return std::get<1>(op.call(output_tensor, input_tensor, process_group));
-}
-
-c10::intrusive_ptr<Work> allgather_coalesced(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    const std::vector<std::vector<at::Tensor>>& output_lists,
-    const at::TensorList& input_list,
-    const AllgatherOptions& opts) {
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::allgather_coalesced_", "")
-                       .typed<c10::intrusive_ptr<Work>(
-                           const std::vector<std::vector<at::Tensor>>&,
-                           const at::TensorList&,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&)>();
-
-  return op.call(output_lists, input_list, process_group);
-}
-
-c10::intrusive_ptr<Work> reduce_scatter(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    const at::TensorList& output_tensors,
-    const std::vector<std::vector<at::Tensor>>& input_tensors,
-    const ReduceScatterOptions& opts) {
-  static auto op =
-      c10::Dispatcher::singleton()
-          .findSchemaOrThrow("c10d::reduce_scatter_", "")
-          .typed<std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
-              const at::TensorList&,
-              const std::vector<std::vector<at::Tensor>>&,
-              const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-              const c10::intrusive_ptr<::c10d::ReduceOp>&,
-              int64_t)>();
-  return std::get<1>(op.call(
-      output_tensors,
-      input_tensors,
-      process_group,
-      c10::make_intrusive<::c10d::ReduceOp>(opts.reduceOp),
-      opts.timeout.count()));
-}
-
-c10::intrusive_ptr<Work> _reduce_scatter_base(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    at::Tensor& output_tensor,
-    at::Tensor& input_tensor,
-    const ReduceScatterOptions& opts) {
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::_reduce_scatter_base_", "")
-                       .typed<std::tuple<at::Tensor, c10::intrusive_ptr<Work>>(
-                           at::Tensor&,
-                           at::Tensor&,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           const c10::intrusive_ptr<::c10d::ReduceOp>&,
-                           int64_t)>();
-  return std::get<1>(op.call(
-      output_tensor,
-      input_tensor,
-      process_group,
-      c10::make_intrusive<::c10d::ReduceOp>(opts.reduceOp),
-      opts.timeout.count()));
-}
-
-c10::intrusive_ptr<Work> reduce(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    at::TensorList tensors,
-    const ReduceOptions& opts) {
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::reduce_", "")
-                       .typed<c10::intrusive_ptr<::c10d::Work>(
-                           at::TensorList,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           const c10::intrusive_ptr<::c10d::ReduceOp>&,
-                           int64_t,
-                           int64_t,
-                           int64_t)>();
-  return op.call(
-      tensors,
-      process_group,
-      c10::make_intrusive<ReduceOp>(opts.reduceOp),
-      opts.rootRank,
-      opts.rootTensor,
-      opts.timeout.count());
-}
-
-c10::intrusive_ptr<Work> gather(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    const std::vector<std::vector<at::Tensor>>& output_tensors,
-    const at::TensorList& input_tensors,
-    const GatherOptions& opts) {
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::gather_", "")
-                       .typed<c10::intrusive_ptr<::c10d::Work>(
-                           const std::vector<std::vector<at::Tensor>>&,
-                           const at::TensorList&,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           int64_t,
-                           int64_t)>();
-  return op.call(
-      output_tensors,
-      input_tensors,
-      process_group,
-      opts.rootRank,
-      opts.timeout.count());
-}
-
-c10::intrusive_ptr<Work> scatter(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    const at::TensorList& output_tensors,
-    const std::vector<std::vector<at::Tensor>>& input_tensors,
-    const ScatterOptions& opts) {
-  static auto op =
-      c10::Dispatcher::singleton()
-          .findSchemaOrThrow("c10d::scatter_", "")
-          .typed<std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
-              const at::TensorList&,
-              const std::vector<std::vector<at::Tensor>>&,
-              const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-              int64_t,
-              int64_t)>();
-  return std::get<1>(op.call(
-      output_tensors,
-      input_tensors,
-      process_group,
-      opts.rootRank,
-      opts.timeout.count()));
-}
-
-c10::intrusive_ptr<Work> alltoall(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    const at::TensorList& output_tensors,
-    const at::TensorList& input_tensors,
-    const AllToAllOptions& opts) {
-  static auto op =
-      c10::Dispatcher::singleton()
-          .findSchemaOrThrow("c10d::alltoall_", "")
-          .typed<std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
-              const at::TensorList&,
-              const at::TensorList&,
-              const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-              int64_t)>();
-  return std::get<1>(op.call(
-      output_tensors, input_tensors, process_group, opts.timeout.count()));
-}
-
-c10::intrusive_ptr<Work> alltoall_base(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    at::Tensor& output,
-    at::Tensor& input,
-    std::vector<int64_t> output_split_sizes,
-    std::vector<int64_t> input_split_sizes,
-    const AllToAllOptions& opts) {
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::alltoall_base_", "")
-                       .typed<c10::intrusive_ptr<::c10d::Work>(
-                           at::Tensor&,
-                           at::Tensor&,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           std::vector<int64_t>,
-                           std::vector<int64_t>,
-                           int64_t)>();
-  return op.call(
-      output,
-      input,
-      process_group,
-      output_split_sizes,
-      input_split_sizes,
-      opts.timeout.count());
-}
-
-void monitored_barrier(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    const BarrierOptions& opts,
-    bool wait_all_ranks) {
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::monitored_barrier_", "")
-                       .typed<void(
-                           at::Tensor,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           const std::vector<int64_t>&,
-                           int64_t,
-                           bool)>();
-  // Default to using cpu implementation, monitored barrier is only for GLOO
-  at::Tensor tensor = at::empty({0}, at::TensorOptions().device(at::kCPU));
-  op.call(
-      tensor,
-      process_group,
-      opts.device_ids,
-      opts.timeout.count(),
-      wait_all_ranks);
-}
-
-c10::intrusive_ptr<Work> barrier(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    const BarrierOptions& opts) {
-  static at::Tensor tensor;
-  // TODO: if nccl was specified then use it
-  if (process_group->getBackendType() ==
-      c10d::ProcessGroup::BackendType::NCCL) {
-    // set cuda tensor
-    tensor = at::empty(
-        {1}, at::TensorOptions().device(at::DeviceType::CUDA).dtype(at::kByte));
-  } else {
-    // Default to using cpu implementation
-    tensor = at::empty(
-        {1}, at::TensorOptions().device(at::DeviceType::CPU).dtype(at::kByte));
-  }
-
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::barrier", "")
-                       .typed<c10::intrusive_ptr<::c10d::Work>(
-                           at::Tensor,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           const std::vector<int64_t>&,
-                           int64_t)>();
-
-  return op.call(tensor, process_group, opts.device_ids, opts.timeout.count());
-}
-
-c10::intrusive_ptr<Work> send(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    at::TensorList tensors,
-    int64_t dstRank,
-    int64_t tag) {
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::send", "")
-                       .typed<c10::intrusive_ptr<::c10d::Work>(
-                           at::TensorList,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           int64_t,
-                           int64_t)>();
-  return op.call(tensors, process_group, dstRank, tag);
-}
-
-c10::intrusive_ptr<Work> recv(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    at::TensorList tensors,
-    int64_t srcRank,
-    int64_t tag) {
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::recv_", "")
-                       .typed<c10::intrusive_ptr<::c10d::Work>(
-                           at::TensorList,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           int64_t,
-                           int64_t)>();
-  return op.call(tensors, process_group, srcRank, tag);
-}
-
-c10::intrusive_ptr<Work> recv_any_source(
-    const c10::intrusive_ptr<ProcessGroup>& process_group,
-    at::TensorList tensors,
-    int64_t tag) {
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::recv_any_source_", "")
-                       .typed<c10::intrusive_ptr<::c10d::Work>(
-                           at::TensorList,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           int64_t)>();
-  return op.call(tensors, process_group, tag);
-}
-


After removing this code block, should we also remove the corresponding API declarations in Ops.hpp?

Good point! Thanks

Fixes #90932 Fixes #90659 Remove redundant collection operation definitions by calling the ops directly from `ProcessGroup` Context: #86225 Differential Revision: [D42854676](https://our.internmc.facebook.com/intern/diff/D42854676) [ghstack-poisoned]

ghstack-source-id: 75124ae Pull Request resolved: #91257

H-Huang · 2023-02-08T23:46:34Z

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

H-Huang · 2023-02-09T14:55:18Z

@pytorchbot merge

pytorchmergebot · 2023-02-09T14:57:33Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

In #91257, we removed direct calls to methods in ops.cpp, so this is updating to also remove ops.hpp Pull Request resolved: #94532 Approved by: https://github.com/kwen2501

Collective dispatching from Process Group

0449751

[ghstack-poisoned]

H-Huang requested review from awgu, kwen2501, mrshenli, pritamdamania87, rohan-varma, wanchaol and zhaojuanmao as code owners December 21, 2022 18:00

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Dec 21, 2022

H-Huang added a commit that referenced this pull request Dec 21, 2022

Collective dispatching from Process Group

bee3f73

ghstack-source-id: 0b6bbfc Pull Request resolved: #91257

Update on "Collective dispatching from Process Group"

6cc2c5b

Fixes #90932 Fixes #90659 Remove redundant collection operation definitions by calling the ops directly from `ProcessGroup` [ghstack-poisoned]

H-Huang added a commit that referenced this pull request Jan 24, 2023

Collective dispatching from Process Group

e24af5d

ghstack-source-id: 0e01944 Pull Request resolved: #91257

Update on "Collective dispatching from Process Group"

77103c2

Fixes #90932 Fixes #90659 Remove redundant collection operation definitions by calling the ops directly from `ProcessGroup` [ghstack-poisoned]

H-Huang added a commit that referenced this pull request Jan 25, 2023

Collective dispatching from Process Group

64cb3f6

ghstack-source-id: 7deb6e0 Pull Request resolved: #91257

Update on "Collective dispatching from Process Group"

54983d1

Fixes #90932 Fixes #90659 Remove redundant collection operation definitions by calling the ops directly from `ProcessGroup` [ghstack-poisoned]

H-Huang added a commit that referenced this pull request Jan 25, 2023

Collective dispatching from Process Group

95b59bb

ghstack-source-id: 288c7f9 Pull Request resolved: #91257

H-Huang requested review from kumpera and removed request for pritamdamania87 January 25, 2023 21:51

H-Huang mentioned this pull request Jan 26, 2023

[prototype] [do not merge] allreduce_sparse code path #93089

Closed

kwen2501 approved these changes Feb 8, 2023

View reviewed changes

H-Huang added a commit that referenced this pull request Feb 8, 2023

Collective dispatching from Process Group

d7b1a68

ghstack-source-id: 75124ae Pull Request resolved: #91257

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 9, 2023

pytorchmergebot added the Merged label Feb 9, 2023

pytorchmergebot closed this in b2ea1d0 Feb 9, 2023

This was referenced Feb 9, 2023

Remove API declarations in Ops.hpp #94532

Closed

Update documentation init_process_group optional backend #94543

Closed

facebook-github-bot deleted the gh/H-Huang/101/head branch June 8, 2023 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Collective dispatching from Process Group #91257

Collective dispatching from Process Group #91257

Uh oh!

H-Huang commented Dec 21, 2022 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 21, 2022 •

edited

Loading

Uh oh!

H-Huang commented Jan 30, 2023

Uh oh!

kwen2501 left a comment

Uh oh!

kwen2501 Feb 8, 2023

Uh oh!

H-Huang Feb 8, 2023

Uh oh!

H-Huang commented Feb 8, 2023

Uh oh!

H-Huang commented Feb 9, 2023

Uh oh!

pytorchmergebot commented Feb 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Collective dispatching from Process Group #91257

Collective dispatching from Process Group #91257

Uh oh!

Conversation

H-Huang commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91257

✅ No Failures

Uh oh!

H-Huang commented Jan 30, 2023

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 Feb 8, 2023

Choose a reason for hiding this comment

Uh oh!

H-Huang Feb 8, 2023

Choose a reason for hiding this comment

Uh oh!

H-Huang commented Feb 8, 2023

Uh oh!

H-Huang commented Feb 9, 2023

Uh oh!

pytorchmergebot commented Feb 9, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

H-Huang commented Dec 21, 2022 •

edited

Loading

pytorch-bot bot commented Dec 21, 2022 •

edited

Loading