Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Jul 28, 2020

Stack from ghstack:

Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

allgather_object and gather_object to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:

  1. Pickle object to ByteTensor using torch.save
  2. Comm. tensor sizes
  3. Copy local ByteTensor into a tensor of maximal size
  4. Call tensor-based collectives on the result of (3)
  5. Unpickle back into object using torch.load

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support scatter, broadcast in follow up PRs.

Differential Revision: D22785387

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Jul 28, 2020

💊 CI failures summary and remediations

As of commit 2594d33 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_windows_vs2019_py36_cuda11.0_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Error generating file
Retry attempt 3: 
C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(236): error: identifier "cusparseScsrmm2" is undefined

C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(259): error: identifier "cusparseDcsrmm2" is undefined

2 errors detected in the compilation of "C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu".
SparseCUDABlas.cu
-- Removing C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/./torch_cuda_generated_SparseCUDABlas.cu.obj
C:/Jenkins/Miniconda3/Library/bin/cmake.exe -E remove C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/./torch_cuda_generated_SparseCUDABlas.cu.obj
CMake Error at torch_cuda_generated_SparseCUDABlas.cu.obj.Release.cmake:281 (message):
  Error generating file
  C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/./torch_cuda_generated_SparseCUDABlas.cu.obj


rbose:BOOL=ON -D build_configuration:STRING=Release -D generated_file:STRING=C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/operators/./torch_cuda_generated_channel_stats_op.cu.obj -D generated_cubin_file:STRING=C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/operators/./torch_cuda_generated_channel_stats_op.cu.obj.cubin.txt -P C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/operators/torch_cuda_generated_channel_stats_op.cu.obj.Release.cmake" 
-- Removing C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/operators/./torch_cuda_generated_channel_stats_op.cu.obj
C:/Jenkins/Miniconda3/Library/bin/cmake.exe -E remove C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/operators/./torch_cuda_generated_channel_stats_op.cu.obj
-- Generating dependency file: C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/operators/torch_cuda_generated_channel_stats_op.cu.obj.NVCC-depend
hird_party/catch/single_include -IC:/Users/circleci/project/aten/src/ATen/.. -IC:/Users/circleci/project/build/caffe2/aten/src/ATen -IC:/Users/circleci/project/c10/cuda/../.. -IC:/Users/circleci/project/c10/../ "-IC:/Program Files/NVIDIA Corporation/NvToolsExt/include" -IC:/Users/circleci/project/torch/csrc/api -IC:/Users/circleci/project/torch/csrc/api/include -IC:/Users/circleci/project/build/third_party/ideep/mkl-dnn/include -IC:/Users/circleci/project/third_party/ideep/mkl-dnn/src/../include
channel_stats_op.cu 
-- Generating temporary cmake readable file: C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/operators/torch_cuda_generated_channel_stats_op.cu.obj.depend.tmp

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 57 times.

Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

[ghstack-poisoned]
Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

[ghstack-poisoned]
Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 28, 2020
Pull Request resolved: #42189

Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.
ghstack-source-id: 108711909

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!
Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

[ghstack-poisoned]
Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 30, 2020
Pull Request resolved: #42189

Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.
ghstack-source-id: 108837035

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!
work.wait()


def _object_to_tensor(obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(can add in followup PRs) Do we need to accept optional pickle_module and pickle_protocol as torch.save does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a need to add these the protocols since we have control over most of the serialization here, i.e. we directly serialize and deserialize ourselves. Looking at the docs it seems that specifying a higher protocol might give some performance win over a lower protocol, but I would advocate for keeping this interface simpler if we don't see demand for it.


def _object_to_tensor(obj):
buffer = io.BytesIO()
torch.save(obj, buffer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, since we only accept picklable objects, any reason for using torch.save instead of pickle.dump? Looks like torch.save will be converting the object into zip-file format, do we know how much overhead that will be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I went with torch.save/torch.load since it seems that is the de-facto way of doing serialization in PT. I think it might be more efficient to just use the pickle module directly, though.

group (ProcessGroup, optional): The process group to work on
Returns:
None. If the calling rank is part of this group, the output of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also say that if the rank is not in the group, the object_list argument will stay intact.

# Avoid copying intermediate tensors back and forth to CUDA by using gloo PG
# for object collectives.
if get_backend(group) == Backend.NCCL:
gloo_group = new_group(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will introduce one more rendezvous, will this be faster than copy the tensor to GPU when using NCCL backend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, maybe we can just go with the transfer to CUDA approach. I was thinking if we can cache this gloo group, then we might pay less amortized cost if this function is called many times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, let's use the provided group in this PR. If we got complain regarding its speed, we can find ways to optimize. Another reason for not creating a new Gloo group is that users might build distributed package without Glo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I think this can work for all_gather_object but for gather_object, we get the error RuntimeError: ProcessGroupNCCL does not support gather. I guess if users try to call all_gather_object with NCCL backend, it makes sense to throw, since NCCL does not implement gather. So should we just throw in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yep, I think it makes sense to just throw when calling gather_object with NCCL backend.

local_max_size_tensor = torch.ByteTensor(size=(max_object_size,))
local_max_size_tensor[: local_size.item()] = input_tensor
output_tensors = [
torch.empty(max_object_size, dtype=torch.uint8) for _ in range(group_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not sure if this can work) we can avoid multiple memory allocation by creating a big tensor and then let output tensors points to its views.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would still end up allocating the same amount of memory right? Although there may be a perf win by doing less memory allocation calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, exactly.

Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

[ghstack-poisoned]
Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Aug 1, 2020
Pull Request resolved: #42189

Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.
ghstack-source-id: 108999423

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!
Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

[ghstack-poisoned]
Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Aug 1, 2020
Pull Request resolved: #42189

Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.
ghstack-source-id: 109003753

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!
@rohan-varma rohan-varma requested a review from mrshenli August 1, 2020 01:10
Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.

Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Aug 31, 2020
Pull Request resolved: #43887

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
ghstack-source-id: 111098700

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!
rohan-varma added a commit that referenced this pull request Sep 1, 2020
As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 1, 2020
Pull Request resolved: #43887

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
ghstack-source-id: 111108942

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!
rohan-varma added a commit that referenced this pull request Sep 1, 2020
Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 1, 2020
As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 1, 2020
Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 1, 2020
As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 1, 2020
Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 2, 2020
Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Sep 2, 2020
Summary:
Pull Request resolved: #43887

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
ghstack-source-id: 111180436

Reviewed By: mrshenli

Differential Revision: D23422577

fbshipit-source-id: fa700abb86eff7128dc29129a0823e83caf4ab0e
rohan-varma added a commit that referenced this pull request Sep 2, 2020
Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 2, 2020
Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 4, 2020
Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 4, 2020
Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 2, 2020
Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 2, 2020
Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 3, 2020
Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 3, 2020
Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 4, 2020
Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 4, 2020
Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Dec 4, 2020
Pull Request resolved: #43930

Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.
ghstack-source-id: 117904065

Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!
facebook-github-bot pushed a commit that referenced this pull request Dec 5, 2020
Summary:
Pull Request resolved: #43930

Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.
ghstack-source-id: 117904065

Reviewed By: mrshenli

Differential Revision: D23430686

fbshipit-source-id: f033b89cd82dadd194f2b036312a98423449c26b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants