Add out= variants for cuda.comm.broadcast/gather/scatter #39681

ssnl · 2020-06-08T22:37:35Z

Partially fixes #38911

dr-ci · 2020-06-08T22:48:24Z

💊 CI failures summary and remediations

As of commit 81c20a0 (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

1/1 broken upstream at merge base f652abc on Jun 23 from 8:48am to 10:37am PDT (11 commits; 79736ff - a54bb4e)

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)

If your commit is older than viable/strict:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_macos_10_13_py3_test on Jun 23 from 8:48am to 10:37am PDT (11 commits; 79736ff - a54bb4e)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 60 times.

ssnl · 2020-06-13T04:05:57Z

test/test_cuda.py

moved comm tests to a separate TestCase. Previously test_gather incorrectly included a non-comm gather test.

Would I be correct if I assume these moved tests stay intact except that they are belong to a different test class?

Mostly, with some tests on out= and error message added. I'll comment to highlight the additions.

ailzhang · 2020-06-17T16:48:20Z

@ngimel @mrshenli Do you know who's the best POC to review this PR? Thanks!

mrshenli · 2020-06-22T14:51:56Z

Sorry about the delay, I will help review this.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli · 2020-06-22T15:12:37Z

test/test_cuda.py

Would I be correct if I assume these moved tests stay intact except that they are belong to a different test class?

mrshenli · 2020-06-22T15:24:34Z

torch/csrc/cuda/comm.cpp

+  std::vector<Tensor> nccl_list;
+  nccl_list.reserve(out_tensors.size() + 1);
+  nccl_list.push_back(tensor);
+  for (auto& out_tensor : out_tensors) {
+    nccl_list.push_back(out_tensor);
+  }


Will it be better to move these lines into the if branch below? So that when nccl is not available but using USE_NCCL=1, we don't have to create this vector?

:) But we need to use this vector<Tensor> to test if NCCL can accept them.

mrshenli · 2020-06-22T15:28:13Z

torch/csrc/cuda/comm.cpp

+      out_tensors[i].sizes() == tensor.sizes(),
+      "Expected all output tensors to have same shape as the source tensor ",
+      tensor.sizes(), ", but output tensor at index ", i, " has shape ",
+      out_tensors[i].sizes());


do we need to check strides?

dont need to. if they are not all contiguous, the naive copy_ will handle this fine.

mrshenli · 2020-06-22T15:31:34Z

torch/csrc/cuda/comm.cpp

+std::vector<Tensor>& broadcast_out(const Tensor& tensor, std::vector<Tensor> &out_tensors) {
+  for (size_t i = 0; i < out_tensors.size(); i++) {
+    TORCH_CHECK(
+      out_tensors[i].is_cuda(),


nit: could you please run clang-format on this file? It might ask for 4 spaces here and several places below.

mrshenli · 2020-06-22T15:34:43Z

torch/csrc/cuda/comm.cpp

+
+// no checks
+static inline
+std::vector<Tensor>& _broadcast_out_impl(const Tensor& tensor, std::vector<Tensor> &out_tensors) {


curious, since the out_tensors is already in the arg, why do we need to return it again?

We don't need to! This can have a void return type. I just followed the python out= and inplace functions signatures and I don't think it matters.

mrshenli · 2020-06-22T15:36:44Z

torch/csrc/cuda/comm.cpp

    }
  }
-  return tensors;
+  _broadcast_out_impl(tensor, diff_device_dst_tensors);


When using NCCL, this will create two vectors of tensors. I wonder if it would be better if we std::move diff_device_dst_tensors and let _broadcast_out_impl take the ownership?

_broadcast_out_impl takes a reference though, so I think it would be okay here.

mrshenli · 2020-06-22T15:44:57Z

torch/csrc/cuda/comm.cpp

+  for (auto device : devices) {
+    if (device != tensor.get_device()) {
+      dst_tensors.push_back(*it++);
+    } else {


I might miss sth, but it doesn't seem this else branch will ever be reached? This function does not add the input tensor to diff_device_dst_tensors, and it seems neither does _broadcast_out_impl?

If the target device is the same as the source device, we don't broadcast for that device (see line 88 above) and just return the source tensor (var tensor) here since there was no need to move.

mrshenli · 2020-06-22T15:45:34Z

torch/csrc/cuda/comm.cpp

+    }
+  }
+  TORCH_INTERNAL_ASSERT(it == diff_device_dst_tensors.end());
+  return dst_tensors;


Why do we need to create a new dst_tensors instead of returning diff_device_dst_tensors?

Because devices can contain the source tensor's device and diff_device_dst_tensors don't include those.

ssnl · 2020-06-22T15:57:07Z

test/test_cuda.py

+            self.assertEqual(t, input)
+            if input.is_cuda and input.get_device() == i:  # test not copying on same device
+                self.assertEqual(t.data_ptr(), input.data_ptr())
+        # test out=


new out test

ssnl · 2020-06-22T15:57:15Z

test/test_cuda.py

+            for i, t in enumerate(results):
+                self.assertEqual(t.get_device(), i)
+                self.assertEqual(t, input)
+        # test error msg


new error test

ssnl · 2020-06-22T15:57:23Z

test/test_cuda.py

+            self.assertEqual(r, input[tuple(index)], atol=0, rtol=0)
+            chunk_start = chunk_end
+
+        # test error msg


new error test

ssnl · 2020-06-22T15:57:27Z

test/test_cuda.py

+                    index[dim] = slice(x.size(dim), x.size(dim) + y.size(dim))
+                    self.assertEqual(result[tuple(index)], y)
+
+        # test error msg


new error test

ssnl · 2020-06-22T15:57:45Z

test/test_cuda.py

+                    expected_device = torch.device('cuda', torch.cuda.current_device())
+                else:
+                    expected_device = destination
+                for use_out in [True, False]:


new out test

ssnl · 2020-06-22T15:57:50Z

test/test_cuda.py

+            if r.device == input.device:
+                self.assertEqual(r.data_ptr(), input.data_ptr())  # for target @ same device, a view should be returned
+
+        # test out


new out test

mrshenli · 2020-06-22T15:52:10Z

torch/csrc/cuda/comm.cpp

-    const int64_t chunk_size_sum =
-        std::accumulate(chunk_sizes->begin(), chunk_sizes->end(), int64_t{0});
+  TORCH_CHECK(!out_tensors.empty(), "Expected at least one output tensor to scatter to");
+  dim = at::maybe_wrap_dim(dim, tensor);


what does maybe_wrap_dim do?

it makes such that negative dims work!

mrshenli · 2020-06-22T15:55:12Z

torch/csrc/cuda/comm.cpp

+      i, " has device '", out_tensors[i].device(), "'");
+    auto out_sizes = out_tensors[i].sizes().vec();
+    bool same_ndim = out_sizes.size() == tensor.dim();
+    if (same_ndim) {


Since we require same_ndim always to be true, shall we do the TORCH_CHECK before this line and drop the branching here?

The TORCH_CHECK also compares against out_sizes which can be only constructed with same_ndim

mrshenli · 2020-06-22T15:57:48Z

torch/csrc/cuda/comm.cpp

+    //     more copying than `scatter(src)`.
+    out_tensors[i].copy_(chunks[i], /*non_blocking=*/true);
+  }
+  return out_tensors;


Same question, is it necessary to return it since it is the same as the reference in the arg list.

mrshenli

LGTM! Except pending for clang-format correction.

mrshenli · 2020-06-23T15:29:33Z

torch/csrc/cuda/comm.cpp

-    all_channels_last = all_channels_last &&
-        tensor.suggest_memory_format() == MemoryFormat::ChannelsLast;
+    if (memory_format != MemoryFormat::Contiguous && tensor.suggest_memory_format() != memory_format) {
+      memory_format = MemoryFormat::Contiguous;


This means any disagreement in memory format across all input tensors would fall back to contiguous memory format?

yeah, I mostly followed what the current logic is, which is a reasonable choice.

mrshenli · 2020-06-23T16:07:23Z

torch/csrc/cuda/python_comm.cpp

          py::arg("destination_index"),
+          py::call_guard<py::gil_scoped_release>())
+      .def(
+          "_gather_out",


This is prior to this PR. Just curious, why we don't support providing streams for gather as well?

:) I don't know. I assume that scatter was specially handled to speed up DP.

mrshenli · 2020-06-23T16:12:06Z

torch/cuda/comm.py

-            devices the tensor should be scattered.
+        tensor (Tensor): tensor to scatter. Can be on CPU or CUDA.
+        devices (Iterable[torch.device, str or int], optional): an iterable of
+          CUDA devices, among which to broadcast.


you mean scatter?

ssnl · 2020-06-23T20:57:09Z

@mrshenli I think this is mergeable now :)

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-06-24T20:24:11Z

@mrshenli merged this pull request in de7ac60.

pytorchbot added the open source label Jun 8, 2020

ssnl force-pushed the commout branch from 8fb99ce to 7921af3 Compare June 8, 2020 22:47

ssnl force-pushed the commout branch 5 times, most recently from 4f3305a to cc8c402 Compare June 9, 2020 03:09

ssnl changed the title [WIP] add out= variants for cuda.comm.* [WIP] add out= variants for cuda.comm.broadcast/gather/scatter Jun 9, 2020

This was referenced Jun 9, 2020

[WIP] Move comm.reduce to c++ and add out= #39709

Closed

[WIP] Move comm.reduce to c++ and add out= #39710

Closed

ssnl force-pushed the commout branch 6 times, most recently from 5155011 to 659d569 Compare June 12, 2020 19:05

ssnl marked this pull request as ready for review June 13, 2020 04:04

ssnl changed the title ~~[WIP] add out= variants for cuda.comm.broadcast/gather/scatter~~ Add out= variants for cuda.comm.broadcast/gather/scatter Jun 13, 2020

ssnl commented Jun 13, 2020

View reviewed changes

ssnl force-pushed the commout branch from c00bf20 to adae488 Compare June 13, 2020 04:07

ailzhang added module: cuda Related to torch.cuda, and CUDA support in general oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jun 17, 2020

ezyang requested a review from mrshenli June 20, 2020 03:25

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 20, 2020

mrshenli mentioned this pull request Jun 22, 2020

Decouple DataParallel/DistributedDataParallel from CUDA #38454

Closed

facebook-github-bot reviewed Jun 22, 2020

View reviewed changes

mrshenli reviewed Jun 22, 2020

View reviewed changes

ssnl commented Jun 22, 2020

View reviewed changes

mrshenli reviewed Jun 22, 2020

View reviewed changes

mrshenli approved these changes Jun 23, 2020

View reviewed changes

ssnl added 5 commits June 23, 2020 13:05

out= variants for broadcast, scatter, and gather

e6a71c8

fix docs

2a17994

add test cases

da50e9c

add tests and warp dim

dcbe8ae

clang-format and fix doc

81c20a0

ssnl force-pushed the commout branch from adae488 to 81c20a0 Compare June 23, 2020 17:18

facebook-github-bot reviewed Jun 24, 2020

View reviewed changes

facebook-github-bot closed this in de7ac60 Jun 24, 2020

facebook-github-bot added the merged label Jun 24, 2020

mruberry added the Merged label Oct 28, 2020

Add out= variants for cuda.comm.broadcast/gather/scatter #39681

Add out= variants for cuda.comm.broadcast/gather/scatter #39681

Uh oh!

Conversation

ssnl commented Jun 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Jun 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🚧 1 fixed upstream failure:

Uh oh!

ssnl Jun 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ailzhang commented Jun 17, 2020

Uh oh!

mrshenli commented Jun 22, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ssnl commented Jun 8, 2020 •

edited

Loading

dr-ci bot commented Jun 8, 2020 •

edited

Loading

ssnl Jun 13, 2020 •

edited

Loading