Skip to content

Conversation

@ssnl
Copy link
Collaborator

@ssnl ssnl commented Jun 8, 2020

Partially fixes #38911

@dr-ci
Copy link

dr-ci bot commented Jun 8, 2020

💊 CI failures summary and remediations

As of commit 81c20a0 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)

If your commit is older than viable/strict:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 60 times.

@ssnl ssnl force-pushed the commout branch 5 times, most recently from 4f3305a to cc8c402 Compare June 9, 2020 03:09
@ssnl ssnl changed the title [WIP] add out= variants for cuda.comm.* [WIP] add out= variants for cuda.comm.broadcast/gather/scatter Jun 9, 2020
@ssnl ssnl force-pushed the commout branch 6 times, most recently from 5155011 to 659d569 Compare June 12, 2020 19:05
@ssnl ssnl marked this pull request as ready for review June 13, 2020 04:04
@ssnl ssnl changed the title [WIP] add out= variants for cuda.comm.broadcast/gather/scatter Add out= variants for cuda.comm.broadcast/gather/scatter Jun 13, 2020
Copy link
Collaborator Author

@ssnl ssnl Jun 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved comm tests to a separate TestCase. Previously test_gather incorrectly included a non-comm gather test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would I be correct if I assume these moved tests stay intact except that they are belong to a different test class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly, with some tests on out= and error message added. I'll comment to highlight the additions.

@ailzhang ailzhang added module: cuda Related to torch.cuda, and CUDA support in general oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jun 17, 2020
@ailzhang
Copy link
Contributor

@ngimel @mrshenli Do you know who's the best POC to review this PR? Thanks!

@ezyang ezyang requested a review from mrshenli June 20, 2020 03:25
@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 20, 2020
@mrshenli
Copy link
Contributor

Sorry about the delay, I will help review this.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would I be correct if I assume these moved tests stay intact except that they are belong to a different test class?

Comment on lines +47 to +56
std::vector<Tensor> nccl_list;
nccl_list.reserve(out_tensors.size() + 1);
nccl_list.push_back(tensor);
for (auto& out_tensor : out_tensors) {
nccl_list.push_back(out_tensor);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be better to move these lines into the if branch below? So that when nccl is not available but using USE_NCCL=1, we don't have to create this vector?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) But we need to use this vector<Tensor> to test if NCCL can accept them.

out_tensors[i].sizes() == tensor.sizes(),
"Expected all output tensors to have same shape as the source tensor ",
tensor.sizes(), ", but output tensor at index ", i, " has shape ",
out_tensors[i].sizes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check strides?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont need to. if they are not all contiguous, the naive copy_ will handle this fine.

std::vector<Tensor>& broadcast_out(const Tensor& tensor, std::vector<Tensor> &out_tensors) {
for (size_t i = 0; i < out_tensors.size(); i++) {
TORCH_CHECK(
out_tensors[i].is_cuda(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you please run clang-format on this file? It might ask for 4 spaces here and several places below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!


// no checks
static inline
std::vector<Tensor>& _broadcast_out_impl(const Tensor& tensor, std::vector<Tensor> &out_tensors) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, since the out_tensors is already in the arg, why do we need to return it again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to! This can have a void return type. I just followed the python out= and inplace functions signatures and I don't think it matters.

}
}
return tensors;
_broadcast_out_impl(tensor, diff_device_dst_tensors);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using NCCL, this will create two vectors of tensors. I wonder if it would be better if we std::move diff_device_dst_tensors and let _broadcast_out_impl take the ownership?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_broadcast_out_impl takes a reference though, so I think it would be okay here.

for (auto device : devices) {
if (device != tensor.get_device()) {
dst_tensors.push_back(*it++);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might miss sth, but it doesn't seem this else branch will ever be reached? This function does not add the input tensor to diff_device_dst_tensors, and it seems neither does _broadcast_out_impl?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the target device is the same as the source device, we don't broadcast for that device (see line 88 above) and just return the source tensor (var tensor) here since there was no need to move.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see

}
}
TORCH_INTERNAL_ASSERT(it == diff_device_dst_tensors.end());
return dst_tensors;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to create a new dst_tensors instead of returning diff_device_dst_tensors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because devices can contain the source tensor's device and diff_device_dst_tensors don't include those.

self.assertEqual(t, input)
if input.is_cuda and input.get_device() == i: # test not copying on same device
self.assertEqual(t.data_ptr(), input.data_ptr())
# test out=
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new out test

for i, t in enumerate(results):
self.assertEqual(t.get_device(), i)
self.assertEqual(t, input)
# test error msg
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new error test

self.assertEqual(r, input[tuple(index)], atol=0, rtol=0)
chunk_start = chunk_end

# test error msg
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new error test

index[dim] = slice(x.size(dim), x.size(dim) + y.size(dim))
self.assertEqual(result[tuple(index)], y)

# test error msg
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new error test

expected_device = torch.device('cuda', torch.cuda.current_device())
else:
expected_device = destination
for use_out in [True, False]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new out test

if r.device == input.device:
self.assertEqual(r.data_ptr(), input.data_ptr()) # for target @ same device, a view should be returned

# test out
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new out test

const int64_t chunk_size_sum =
std::accumulate(chunk_sizes->begin(), chunk_sizes->end(), int64_t{0});
TORCH_CHECK(!out_tensors.empty(), "Expected at least one output tensor to scatter to");
dim = at::maybe_wrap_dim(dim, tensor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does maybe_wrap_dim do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes such that negative dims work!

i, " has device '", out_tensors[i].device(), "'");
auto out_sizes = out_tensors[i].sizes().vec();
bool same_ndim = out_sizes.size() == tensor.dim();
if (same_ndim) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we require same_ndim always to be true, shall we do the TORCH_CHECK before this line and drop the branching here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TORCH_CHECK also compares against out_sizes which can be only constructed with same_ndim

// more copying than `scatter(src)`.
out_tensors[i].copy_(chunks[i], /*non_blocking=*/true);
}
return out_tensors;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, is it necessary to return it since it is the same as the reference in the arg list.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Except pending for clang-format correction.

all_channels_last = all_channels_last &&
tensor.suggest_memory_format() == MemoryFormat::ChannelsLast;
if (memory_format != MemoryFormat::Contiguous && tensor.suggest_memory_format() != memory_format) {
memory_format = MemoryFormat::Contiguous;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means any disagreement in memory format across all input tensors would fall back to contiguous memory format?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I mostly followed what the current logic is, which is a reasonable choice.

py::arg("destination_index"),
py::call_guard<py::gil_scoped_release>())
.def(
"_gather_out",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is prior to this PR. Just curious, why we don't support providing streams for gather as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) I don't know. I assume that scatter was specially handled to speed up DP.

devices the tensor should be scattered.
tensor (Tensor): tensor to scatter. Can be on CPU or CUDA.
devices (Iterable[torch.device, str or int], optional): an iterable of
CUDA devices, among which to broadcast.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean scatter?

@ssnl
Copy link
Collaborator Author

ssnl commented Jun 23, 2020

@mrshenli I think this is mergeable now :)

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in de7ac60.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cuda Related to torch.cuda, and CUDA support in general oncall: distributed Add this issue/PR to distributed oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FR] cuda.comm.broadcast/reduce_add support out=

7 participants