-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add out= variants for cuda.comm.broadcast/gather/scatter #39681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 81c20a0 (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
🚧 1 fixed upstream failure:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
4f3305a to
cc8c402
Compare
5155011 to
659d569
Compare
test/test_cuda.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved comm tests to a separate TestCase. Previously test_gather incorrectly included a non-comm gather test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would I be correct if I assume these moved tests stay intact except that they are belong to a different test class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly, with some tests on out= and error message added. I'll comment to highlight the additions.
|
Sorry about the delay, I will help review this. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
test/test_cuda.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would I be correct if I assume these moved tests stay intact except that they are belong to a different test class?
| std::vector<Tensor> nccl_list; | ||
| nccl_list.reserve(out_tensors.size() + 1); | ||
| nccl_list.push_back(tensor); | ||
| for (auto& out_tensor : out_tensors) { | ||
| nccl_list.push_back(out_tensor); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be better to move these lines into the if branch below? So that when nccl is not available but using USE_NCCL=1, we don't have to create this vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) But we need to use this vector<Tensor> to test if NCCL can accept them.
torch/csrc/cuda/comm.cpp
Outdated
| out_tensors[i].sizes() == tensor.sizes(), | ||
| "Expected all output tensors to have same shape as the source tensor ", | ||
| tensor.sizes(), ", but output tensor at index ", i, " has shape ", | ||
| out_tensors[i].sizes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to check strides?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont need to. if they are not all contiguous, the naive copy_ will handle this fine.
torch/csrc/cuda/comm.cpp
Outdated
| std::vector<Tensor>& broadcast_out(const Tensor& tensor, std::vector<Tensor> &out_tensors) { | ||
| for (size_t i = 0; i < out_tensors.size(); i++) { | ||
| TORCH_CHECK( | ||
| out_tensors[i].is_cuda(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you please run clang-format on this file? It might ask for 4 spaces here and several places below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
torch/csrc/cuda/comm.cpp
Outdated
|
|
||
| // no checks | ||
| static inline | ||
| std::vector<Tensor>& _broadcast_out_impl(const Tensor& tensor, std::vector<Tensor> &out_tensors) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious, since the out_tensors is already in the arg, why do we need to return it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to! This can have a void return type. I just followed the python out= and inplace functions signatures and I don't think it matters.
| } | ||
| } | ||
| return tensors; | ||
| _broadcast_out_impl(tensor, diff_device_dst_tensors); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using NCCL, this will create two vectors of tensors. I wonder if it would be better if we std::move diff_device_dst_tensors and let _broadcast_out_impl take the ownership?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_broadcast_out_impl takes a reference though, so I think it would be okay here.
| for (auto device : devices) { | ||
| if (device != tensor.get_device()) { | ||
| dst_tensors.push_back(*it++); | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might miss sth, but it doesn't seem this else branch will ever be reached? This function does not add the input tensor to diff_device_dst_tensors, and it seems neither does _broadcast_out_impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the target device is the same as the source device, we don't broadcast for that device (see line 88 above) and just return the source tensor (var tensor) here since there was no need to move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see
| } | ||
| } | ||
| TORCH_INTERNAL_ASSERT(it == diff_device_dst_tensors.end()); | ||
| return dst_tensors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to create a new dst_tensors instead of returning diff_device_dst_tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because devices can contain the source tensor's device and diff_device_dst_tensors don't include those.
| self.assertEqual(t, input) | ||
| if input.is_cuda and input.get_device() == i: # test not copying on same device | ||
| self.assertEqual(t.data_ptr(), input.data_ptr()) | ||
| # test out= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new out test
| for i, t in enumerate(results): | ||
| self.assertEqual(t.get_device(), i) | ||
| self.assertEqual(t, input) | ||
| # test error msg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new error test
| self.assertEqual(r, input[tuple(index)], atol=0, rtol=0) | ||
| chunk_start = chunk_end | ||
|
|
||
| # test error msg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new error test
| index[dim] = slice(x.size(dim), x.size(dim) + y.size(dim)) | ||
| self.assertEqual(result[tuple(index)], y) | ||
|
|
||
| # test error msg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new error test
| expected_device = torch.device('cuda', torch.cuda.current_device()) | ||
| else: | ||
| expected_device = destination | ||
| for use_out in [True, False]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new out test
| if r.device == input.device: | ||
| self.assertEqual(r.data_ptr(), input.data_ptr()) # for target @ same device, a view should be returned | ||
|
|
||
| # test out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new out test
| const int64_t chunk_size_sum = | ||
| std::accumulate(chunk_sizes->begin(), chunk_sizes->end(), int64_t{0}); | ||
| TORCH_CHECK(!out_tensors.empty(), "Expected at least one output tensor to scatter to"); | ||
| dim = at::maybe_wrap_dim(dim, tensor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does maybe_wrap_dim do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it makes such that negative dims work!
| i, " has device '", out_tensors[i].device(), "'"); | ||
| auto out_sizes = out_tensors[i].sizes().vec(); | ||
| bool same_ndim = out_sizes.size() == tensor.dim(); | ||
| if (same_ndim) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we require same_ndim always to be true, shall we do the TORCH_CHECK before this line and drop the branching here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TORCH_CHECK also compares against out_sizes which can be only constructed with same_ndim
| // more copying than `scatter(src)`. | ||
| out_tensors[i].copy_(chunks[i], /*non_blocking=*/true); | ||
| } | ||
| return out_tensors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question, is it necessary to return it since it is the same as the reference in the arg list.
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Except pending for clang-format correction.
| all_channels_last = all_channels_last && | ||
| tensor.suggest_memory_format() == MemoryFormat::ChannelsLast; | ||
| if (memory_format != MemoryFormat::Contiguous && tensor.suggest_memory_format() != memory_format) { | ||
| memory_format = MemoryFormat::Contiguous; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means any disagreement in memory format across all input tensors would fall back to contiguous memory format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I mostly followed what the current logic is, which is a reasonable choice.
| py::arg("destination_index"), | ||
| py::call_guard<py::gil_scoped_release>()) | ||
| .def( | ||
| "_gather_out", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is prior to this PR. Just curious, why we don't support providing streams for gather as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) I don't know. I assume that scatter was specially handled to speed up DP.
torch/cuda/comm.py
Outdated
| devices the tensor should be scattered. | ||
| tensor (Tensor): tensor to scatter. Can be on CPU or CUDA. | ||
| devices (Iterable[torch.device, str or int], optional): an iterable of | ||
| CUDA devices, among which to broadcast. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean scatter?
|
@mrshenli I think this is mergeable now :) |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Partially fixes #38911