All Gather and gather APIs for Python Objects #42189

rohan-varma · 2020-07-28T20:41:04Z

Stack from ghstack:

All Gather and gather APIs for Python Objects #42189 All Gather and gather APIs for Python Objects

Rehash of #28811, which was several months old.

As part of addressing #23232, this PR adds support for the following APIs:

allgather_object and gather_object to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:

Pickle object to ByteTensor using torch.save
Comm. tensor sizes
Copy local ByteTensor into a tensor of maximal size
Call tensor-based collectives on the result of (3)
Unpickle back into object using torch.load

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support scatter, broadcast in follow up PRs.

Differential Revision: D22785387

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Rehash of #28811, which was several months old. As part of addressing #23232, this PR adds support for the following APIs: `allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in. The methodology is what is proposed in the original issue: 1) Pickle object to ByteTensor using torch.save 2) Comm. tensor sizes 3) Copy local ByteTensor into a tensor of maximal size 4) Call tensor-based collectives on the result of (3) 5) Unpickle back into object using torch.load Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this. If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs. Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)! [ghstack-poisoned]

dr-ci · 2020-07-28T20:42:30Z

💊 CI failures summary and remediations

As of commit 2594d33 (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_windows_vs2019_py36_cuda11.0_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Error generating file

Retry attempt 3: 
C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(236): error: identifier "cusparseScsrmm2" is undefined

C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(259): error: identifier "cusparseDcsrmm2" is undefined

2 errors detected in the compilation of "C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu".
SparseCUDABlas.cu
-- Removing C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/./torch_cuda_generated_SparseCUDABlas.cu.obj
C:/Jenkins/Miniconda3/Library/bin/cmake.exe -E remove C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/./torch_cuda_generated_SparseCUDABlas.cu.obj
CMake Error at torch_cuda_generated_SparseCUDABlas.cu.obj.Release.cmake:281 (message):
  Error generating file
  C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/./torch_cuda_generated_SparseCUDABlas.cu.obj


rbose:BOOL=ON -D build_configuration:STRING=Release -D generated_file:STRING=C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/operators/./torch_cuda_generated_channel_stats_op.cu.obj -D generated_cubin_file:STRING=C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/operators/./torch_cuda_generated_channel_stats_op.cu.obj.cubin.txt -P C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/operators/torch_cuda_generated_channel_stats_op.cu.obj.Release.cmake" 
-- Removing C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/operators/./torch_cuda_generated_channel_stats_op.cu.obj
C:/Jenkins/Miniconda3/Library/bin/cmake.exe -E remove C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/operators/./torch_cuda_generated_channel_stats_op.cu.obj
-- Generating dependency file: C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/operators/torch_cuda_generated_channel_stats_op.cu.obj.NVCC-depend
hird_party/catch/single_include -IC:/Users/circleci/project/aten/src/ATen/.. -IC:/Users/circleci/project/build/caffe2/aten/src/ATen -IC:/Users/circleci/project/c10/cuda/../.. -IC:/Users/circleci/project/c10/../ "-IC:/Program Files/NVIDIA Corporation/NvToolsExt/include" -IC:/Users/circleci/project/torch/csrc/api -IC:/Users/circleci/project/torch/csrc/api/include -IC:/Users/circleci/project/build/third_party/ideep/mkl-dnn/include -IC:/Users/circleci/project/third_party/ideep/mkl-dnn/src/../include
channel_stats_op.cu 
-- Generating temporary cmake readable file: C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/operators/torch_cuda_generated_channel_stats_op.cu.obj.depend.tmp

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 57 times.

Rehash of #28811, which was several months old. As part of addressing #23232, this PR adds support for the following APIs: `allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in. The methodology is what is proposed in the original issue: 1) Pickle object to ByteTensor using torch.save 2) Comm. tensor sizes 3) Copy local ByteTensor into a tensor of maximal size 4) Call tensor-based collectives on the result of (3) 5) Unpickle back into object using torch.load Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this. If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs. Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)! [ghstack-poisoned]

Pull Request resolved: #42189 Rehash of #28811, which was several months old. As part of addressing #23232, this PR adds support for the following APIs: `allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in. The methodology is what is proposed in the original issue: 1) Pickle object to ByteTensor using torch.save 2) Comm. tensor sizes 3) Copy local ByteTensor into a tensor of maximal size 4) Call tensor-based collectives on the result of (3) 5) Unpickle back into object using torch.load Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this. If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs. ghstack-source-id: 108711909 Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

Rehash of #28811, which was several months old. As part of addressing #23232, this PR adds support for the following APIs: `allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in. The methodology is what is proposed in the original issue: 1) Pickle object to ByteTensor using torch.save 2) Comm. tensor sizes 3) Copy local ByteTensor into a tensor of maximal size 4) Call tensor-based collectives on the result of (3) 5) Unpickle back into object using torch.load Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this. If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs. Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)! [ghstack-poisoned]

Pull Request resolved: #42189 Rehash of #28811, which was several months old. As part of addressing #23232, this PR adds support for the following APIs: `allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in. The methodology is what is proposed in the original issue: 1) Pickle object to ByteTensor using torch.save 2) Comm. tensor sizes 3) Copy local ByteTensor into a tensor of maximal size 4) Call tensor-based collectives on the result of (3) 5) Unpickle back into object using torch.load Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this. If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs. ghstack-source-id: 108837035 Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

mrshenli · 2020-07-31T14:49:22Z

torch/distributed/distributed_c10d.py

        work.wait()


+def _object_to_tensor(obj):


(can add in followup PRs) Do we need to accept optional pickle_module and pickle_protocol as torch.save does?

I didn't see a need to add these the protocols since we have control over most of the serialization here, i.e. we directly serialize and deserialize ourselves. Looking at the docs it seems that specifying a higher protocol might give some performance win over a lower protocol, but I would advocate for keeping this interface simpler if we don't see demand for it.

mrshenli · 2020-07-31T14:51:27Z

torch/distributed/distributed_c10d.py


+def _object_to_tensor(obj):
+    buffer = io.BytesIO()
+    torch.save(obj, buffer)


Question, since we only accept picklable objects, any reason for using torch.save instead of pickle.dump? Looks like torch.save will be converting the object into zip-file format, do we know how much overhead that will be?

I see, I went with torch.save/torch.load since it seems that is the de-facto way of doing serialization in PT. I think it might be more efficient to just use the pickle module directly, though.

mrshenli · 2020-07-31T15:12:14Z

torch/distributed/distributed_c10d.py

+        group (ProcessGroup, optional): The process group to work on
+
+    Returns:
+        None. If the calling rank is part of this group, the output of the


Let's also say that if the rank is not in the group, the object_list argument will stay intact.

mrshenli · 2020-07-31T15:20:13Z

torch/distributed/distributed_c10d.py

+    # Avoid copying intermediate tensors back and forth to CUDA by using gloo PG
+    # for object collectives.
+    if get_backend(group) == Backend.NCCL:
+        gloo_group = new_group(


this will introduce one more rendezvous, will this be faster than copy the tensor to GPU when using NCCL backend?

I see, maybe we can just go with the transfer to CUDA approach. I was thinking if we can cache this gloo group, then we might pay less amortized cost if this function is called many times.

I see, let's use the provided group in this PR. If we got complain regarding its speed, we can find ways to optimize. Another reason for not creating a new Gloo group is that users might build distributed package without Glo.

Sounds good, I think this can work for all_gather_object but for gather_object, we get the error RuntimeError: ProcessGroupNCCL does not support gather. I guess if users try to call all_gather_object with NCCL backend, it makes sense to throw, since NCCL does not implement gather. So should we just throw in this case?

I see. Yep, I think it makes sense to just throw when calling gather_object with NCCL backend.

mrshenli · 2020-07-31T15:23:57Z

torch/distributed/distributed_c10d.py

+    local_max_size_tensor = torch.ByteTensor(size=(max_object_size,))
+    local_max_size_tensor[: local_size.item()] = input_tensor
+    output_tensors = [
+        torch.empty(max_object_size, dtype=torch.uint8) for _ in range(group_size)


(not sure if this can work) we can avoid multiple memory allocation by creating a big tensor and then let output tensors points to its views.

We would still end up allocating the same amount of memory right? Although there may be a perf win by doing less memory allocation calls.

Yep, exactly.

torch/distributed/distributed_c10d.py

Rehash of #28811, which was several months old. As part of addressing #23232, this PR adds support for the following APIs: `allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in. The methodology is what is proposed in the original issue: 1) Pickle object to ByteTensor using torch.save 2) Comm. tensor sizes 3) Copy local ByteTensor into a tensor of maximal size 4) Call tensor-based collectives on the result of (3) 5) Unpickle back into object using torch.load Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this. If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs. Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)! [ghstack-poisoned]

Pull Request resolved: #42189 Rehash of #28811, which was several months old. As part of addressing #23232, this PR adds support for the following APIs: `allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in. The methodology is what is proposed in the original issue: 1) Pickle object to ByteTensor using torch.save 2) Comm. tensor sizes 3) Copy local ByteTensor into a tensor of maximal size 4) Call tensor-based collectives on the result of (3) 5) Unpickle back into object using torch.load Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this. If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs. ghstack-source-id: 108999423 Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

Rehash of #28811, which was several months old. As part of addressing #23232, this PR adds support for the following APIs: `allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in. The methodology is what is proposed in the original issue: 1) Pickle object to ByteTensor using torch.save 2) Comm. tensor sizes 3) Copy local ByteTensor into a tensor of maximal size 4) Call tensor-based collectives on the result of (3) 5) Unpickle back into object using torch.load Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this. If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs. Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)! [ghstack-poisoned]

Pull Request resolved: #42189 Rehash of #28811, which was several months old. As part of addressing #23232, this PR adds support for the following APIs: `allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in. The methodology is what is proposed in the original issue: 1) Pickle object to ByteTensor using torch.save 2) Comm. tensor sizes 3) Copy local ByteTensor into a tensor of maximal size 4) Call tensor-based collectives on the result of (3) 5) Unpickle back into object using torch.load Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this. If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs. ghstack-source-id: 109003753 Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)!

Rehash of #28811, which was several months old. As part of addressing #23232, this PR adds support for the following APIs: `allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in. The methodology is what is proposed in the original issue: 1) Pickle object to ByteTensor using torch.save 2) Comm. tensor sizes 3) Copy local ByteTensor into a tensor of maximal size 4) Call tensor-based collectives on the result of (3) 5) Unpickle back into object using torch.load Note that the API is designed to match the tensor-based collectives other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this. If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs. Differential Revision: [D22785387](https://our.internmc.facebook.com/intern/diff/D22785387/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22785387/)! [ghstack-poisoned]

Pull Request resolved: #43887 As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. ghstack-source-id: 111098700 Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]

Pull Request resolved: #43887 As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. ghstack-source-id: 111108942 Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)!

Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks. The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter. Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. It only works for Gloo because NCCL doesn't support scatter. Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)! [ghstack-poisoned]

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]

Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks. The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter. Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. It only works for Gloo because NCCL doesn't support scatter. Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)! [ghstack-poisoned]

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. Differential Revision: [D23422577](https://our.internmc.facebook.com/intern/diff/D23422577/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23422577/)! [ghstack-poisoned]

Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks. The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter. Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. It only works for Gloo because NCCL doesn't support scatter. Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)! [ghstack-poisoned]

Summary: Pull Request resolved: #43887 As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this. The implementation approach follows a similar approach as #42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. ghstack-source-id: 111180436 Reviewed By: mrshenli Differential Revision: D23422577 fbshipit-source-id: fa700abb86eff7128dc29129a0823e83caf4ab0e

Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks. The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter. Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. It only works for Gloo because NCCL doesn't support scatter. Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)! [ghstack-poisoned]

Pull Request resolved: #43930 Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks. The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter. Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. It only works for Gloo because NCCL doesn't support scatter. ghstack-source-id: 117904065 Differential Revision: [D23430686](https://our.internmc.facebook.com/intern/diff/D23430686/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23430686/)!

Summary: Pull Request resolved: #43930 Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks. The implementation approach follows a similar approach as #42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter. Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object. Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this. It only works for Gloo because NCCL doesn't support scatter. ghstack-source-id: 117904065 Reviewed By: mrshenli Differential Revision: D23430686 fbshipit-source-id: f033b89cd82dadd194f2b036312a98423449c26b

rohan-varma requested review from apaszke, mrshenli, pietern, pritamdamania87 and zhaojuanmao as code owners July 28, 2020 20:41

rohan-varma mentioned this pull request Jul 28, 2020

[distributed] implement all_gather for arbitrary python objects #28811

Closed

mrshenli reviewed Jul 31, 2020

View reviewed changes

rohan-varma requested a review from mrshenli August 1, 2020 01:10

rohan-varma mentioned this pull request Sep 1, 2020

scatter_object_list API for c10d #43930

Closed

rohan-varma mentioned this pull request Sep 14, 2020

[RFC] Integrate profiler with torch.distributed APIs for profiling of distributed models #44673

Closed

mruberry added the Merged label Oct 28, 2020

cattaneod mentioned this pull request Feb 2, 2021

Gather all validation_step outputs on one machine Lightning-AI/pytorch-lightning#4175

Closed

fmassa mentioned this pull request May 10, 2021

Replace uses of custom all_gather with PyTorch's all_gather_object pytorch/vision#3804

Closed

aleSuglia mentioned this pull request Sep 20, 2021

Allow broadcasting pickable Python objects in distributed setup Lightning-AI/torchmetrics#538

Closed

All Gather and gather APIs for Python Objects #42189

All Gather and gather APIs for Python Objects #42189

Uh oh!

Conversation

rohan-varma commented Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_windows_vs2019_py36_cuda11.0_build (1/1)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rohan-varma commented Jul 28, 2020 •

edited

Loading

dr-ci bot commented Jul 28, 2020 •

edited

Loading