[shard] Add ReplicatedTensor #73529

wanchaol · 2022-02-28T20:04:05Z

Stack from ghstack (oldest at bottom):

Add ReplicatedTensor, a ReplicatedTensor is a type of tensor that have the same value on all ranks across the world_size.

ReplicatedTensor is a :class:~torch.Tensor subclass, and it could be used together with ShardedTensor/Tensor together to express different types of computation. The inter-op rules defined as (using torch.add as an example op):
ReplicatedTensor + ReplicatedTensor = ReplicatedTensor
ReplicatedTensor + torch.Tensor = torch.Tensor
ReplicatedTensor + ShardedTensor = ShardedTensor

We also added a validate() API to help user validate if a replicated tensor on certain process_group is truly replicated or not.

TODO: next PR gonna add ShardedTensor/PartialTensor logic to handle ReplicatedTensor.

Differential Revision: D34529374

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Add ReplicatedTensor, a ReplicatedTensor is a type of tensor that have the same value on all ranks across the world_size. ReplicatedTensor is a :class:`~torch.Tensor` subclass, and it could be used together with ShardedTensor/Tensor together to express different types of computation. The inter-op rules defined as (using torch.add as an example op): ReplicatedTensor + ReplicatedTensor = ReplicatedTensor ReplicatedTensor + torch.Tensor = torch.Tensor ReplicatedTensor + ShardedTensor = ShardedTensor We also added a `validate()` API to help user validate if a replicated tensor on certain process_group is truly replicated or not. TODO: next PR gonna add ShardedTensor/PartialTensor logic to handle ReplicatedTensor. Differential Revision: [D34529374](https://our.internmc.facebook.com/intern/diff/D34529374/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34529374/)! [ghstack-poisoned]

facebook-github-bot · 2022-02-28T20:04:08Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/73529
↩️ [fb-only] Re-run with SSH instructions
Need help or want to give feedback on the CI? Visit our office hours

💊 CI failures summary and remediations

As of commit 64f70b1 (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pull / pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-24T00:21:40.0074595Z RuntimeError: /var...or_util.cpp:1109 : Type not supported: ComplexHalf

2022-03-24T00:21:40.0069735Z ----------------------------------------------------------------------
2022-03-24T00:21:40.0070114Z Traceback (most recent call last):
2022-03-24T00:21:40.0070759Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 389, in instantiated_test
2022-03-24T00:21:40.0071275Z     raise rte
2022-03-24T00:21:40.0071867Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 376, in instantiated_test
2022-03-24T00:21:40.0072302Z     result = test(self, **param_kwargs)
2022-03-24T00:21:40.0072686Z   File "/var/lib/jenkins/workspace/xla/test/../../test/test_torch.py", line 5198, in test_copy_
2022-03-24T00:21:40.0073073Z     src = make_tensor_wrapper((50,), dtype=src_dtype)
2022-03-24T00:21:40.0073621Z   File "/var/lib/jenkins/workspace/xla/test/../../test/test_torch.py", line 5193, in make_tensor_wrapper
2022-03-24T00:21:40.0074066Z     return torch.randn(shape, device=device, dtype=dtype)
2022-03-24T00:21:40.0074595Z RuntimeError: /var/lib/jenkins/workspace/xla/torch_xla/csrc/tensor_util.cpp:1109 : Type not supported: ComplexHalf
2022-03-24T00:21:40.0074957Z 
2022-03-24T00:21:40.1030718Z ----------------------------------------------------------------------
2022-03-24T00:21:40.1031172Z Ran 594 tests in 558.694s
2022-03-24T00:21:40.1031293Z 
2022-03-24T00:21:40.1031416Z FAILED (errors=9, skipped=367, expected failures=27)
2022-03-24T00:21:40.1031565Z 
2022-03-24T00:21:40.1031651Z Generating XML reports...
2022-03-24T00:21:40.1032121Z Generated XML report: test-reports/python-unittest/test.......test.test_torch/TEST-TestTorchDeviceTypeXLA-20220324001221.xml
2022-03-24T00:21:40.4660003Z + cleanup
2022-03-24T00:21:40.4660338Z + retcode=1

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

pytorch-bot · 2022-02-28T20:04:09Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/89bbf1737460e72ceaf04c13f6d958d3244863e8/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
linux-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
linux-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
linux-binary-manywheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	✅ triggered
linux-bionic-rocm4.5-py3.7	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/rocm`, `ciflow/trunk`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
macos-arm64-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-arm64-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
macos-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
windows-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
windows-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
windows-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`, `ciflow/xla`	🚫 skipped

Add ReplicatedTensor, a ReplicatedTensor is a type of tensor that have the same value on all ranks across the world_size. ReplicatedTensor is a :class:`~torch.Tensor` subclass, and it could be used together with ShardedTensor/Tensor together to express different types of computation. The inter-op rules defined as (using torch.add as an example op): ReplicatedTensor + ReplicatedTensor = ReplicatedTensor ReplicatedTensor + torch.Tensor = torch.Tensor ReplicatedTensor + ShardedTensor = ShardedTensor We also added a `validate()` API to help user validate if a replicated tensor on certain process_group is truly replicated or not. TODO: next PR gonna add ShardedTensor/PartialTensor logic to handle ReplicatedTensor. Differential Revision: [D34529374](https://our.internmc.facebook.com/intern/diff/D34529374/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34529374/)! ghstack-source-id: 150134881 Pull Request resolved: #73529

Add ReplicatedTensor, a ReplicatedTensor is a type of tensor that have the same value on all ranks across the world_size. ReplicatedTensor is a :class:`~torch.Tensor` subclass, and it could be used together with ShardedTensor/Tensor together to express different types of computation. The inter-op rules defined as (using torch.add as an example op): ReplicatedTensor + ReplicatedTensor = ReplicatedTensor ReplicatedTensor + torch.Tensor = torch.Tensor ReplicatedTensor + ShardedTensor = ShardedTensor We also added a `validate()` API to help user validate if a replicated tensor on certain process_group is truly replicated or not. TODO: next PR gonna add ShardedTensor/PartialTensor logic to handle ReplicatedTensor. Differential Revision: [D34529374](https://our.internmc.facebook.com/intern/diff/D34529374/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34529374/)! [ghstack-poisoned]

fduwjj

It looks great and thanks for working on this!

fduwjj · 2022-03-09T19:31:53Z

torch/distributed/_shard/replicated_tensor.py

+
+class ReplicatedTensor(torch.Tensor):
+    """
+    ReplicatedTensor represents a tensor which is replicated across the world_size and


Nit: world_size?

fduwjj · 2022-03-09T19:32:40Z

torch/distributed/_shard/replicated_tensor.py

+    inter-op rules defined as (using torch.add as an example op):
+        ReplicatedTensor + ReplicatedTensor = ReplicatedTensor
+        ReplicatedTensor + torch.Tensor = torch.Tensor
+        ReplicatedTensor + ShardedTensor = ShardedTensor


Can we also add one for _PartialTensor?

I thought about it when adding the comment, I decided to leave _PartialTensor out because:

It's not a public API yet

Ideally PartialTensor is not sth that user need to be aware of or worry about, it's more like intermediate results handled by our internal system. So I'm worried if we add the comment here, user might be a bit confused and need to learn what a PartialTensor is?

Let me know if that make sense or not :)

Got it, makes sense. We can always change it later on.

fduwjj · 2022-03-09T19:40:22Z

torch/distributed/_shard/replicated_tensor.py

+        return f"ReplicatedTensor({super(ReplicatedTensor, self).__repr__()})"
+
+    @classmethod
+    def __torch_function__(cls, func, types, args=(), kwargs=None):


Just one n00b question here, with this override, we can enable the comparison in the test right? (Like self.assertEqual?)

Do you mean comparison between ReplicatedTensor and Tensor, or ShardedTensor? Since ReplicatedTensor is a subclass of Tensor, it should work well with Tensor, have to check if it works with ShardedTensor, will try adding some tests there.

Just checked there's already assertEqual between Tensor/ReplicatedTensor, it's working, but for ReplicatedTensor/ShardedTensor, it's not yet supported, we need to add handling logic to binary_cmp, but in theory, I guess ShardedTensor will never equal to a ReplicatedTensor as they are different topology? we might need to define the rule for this here. cc @pritamdamania

fduwjj · 2022-03-09T19:42:39Z

torch/distributed/_shard/replicated_tensor.py

+        # base on the inter-op rules we defined.
+        with torch._C.DisableTorchFunction():
+            rs = func(*new_args, **new_kwargs)
+            if func in get_default_nowrap_functions():


For learning purpose. Does this often mean the situation when it's field access like t.grad?

yeah, more related to field access, this was copied from the tensor.__torch_function__ because we don't want to go into this function as we manage our own type through the rules we defined.

Add ReplicatedTensor, a ReplicatedTensor is a type of tensor that have the same value on all ranks across the world_size. ReplicatedTensor is a :class:`~torch.Tensor` subclass, and it could be used together with ShardedTensor/Tensor together to express different types of computation. The inter-op rules defined as (using torch.add as an example op): ReplicatedTensor + ReplicatedTensor = ReplicatedTensor ReplicatedTensor + torch.Tensor = torch.Tensor ReplicatedTensor + ShardedTensor = ShardedTensor We also added a `validate()` API to help user validate if a replicated tensor on certain process_group is truly replicated or not. TODO: next PR gonna add ShardedTensor/PartialTensor logic to handle ReplicatedTensor. Differential Revision: [D34529374](https://our.internmc.facebook.com/intern/diff/D34529374/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34529374/)! [ghstack-poisoned]

pritamdamania87 · 2022-03-15T01:17:39Z

torch/distributed/_shard/api.py

+
+def _replicate_tensor(tensor: torch.Tensor) -> ReplicatedTensor:
+    """
+    Given a :class:`torch.Tensor`, mark it as a ReplicatedTensore where all


nit: s/ReplicatedTensore/ReplicatedTensor

pritamdamania87 · 2022-03-15T01:17:59Z

torch/distributed/_shard/api.py

    setattr(module, param_name, st)
+
+
+def _replicate_tensor(tensor: torch.Tensor) -> ReplicatedTensor:


Do we really need an API like this? Users can just call ReplicatedTensor(tensor)?

My original intention on introducing this API is to make sure we have consistent APIs provided to the user. We now have:

shard_module(module,plan)

shard_parameter(module, param_name, spec)

_shard_tensor(tensor, spec)

All these three APIs are being used to mark a param or tensor as ShardedTensor, I feel that we should have a similar API to mark a tensor as ReplicatedTensor, it makes the API more consistent from user prospective. Let me know if this make sense or we could just use ReplicatedTensor(tensor)

pritamdamania87 · 2022-03-15T01:22:45Z

torch/distributed/_shard/replicated_tensor.py

+    NOTE: We do not gurantee equal content of ReplicatedTensor across nodes after its
+    construction. Although we defined proper inter-op rules to make sure ReplicatedTensor
+    stays the same, there's no enforcement on it (i.e. if you manually modify content on
+    some ranks, the modified value will not automatically get synced to other nodes). If
+    you wish to manually validate tensors are the same across ranks, use `validate()`.


Not for this PR, but I think we need two modes for ReplicatedTensor. The first mode is what we have here where it is just a tag to help sharded computations, but probably this should not be the default mode.

I think the default mode should be similar to DDP, where ReplicatedTensor broadcasts the torch.Tensor on rank 0 (probably can also be optionally specified which rank). Then in the backward pass for this mode, we always allreduce the gradients for the ReplicatedTensor. This means ReplicatedTensor can stand on its own.

pritamdamania87 · 2022-03-15T01:23:16Z

torch/distributed/_shard/replicated_tensor.py

+    you wish to manually validate tensors are the same across ranks, use `validate()`.
+
+    """
+    def __new__(cls, data=None):


ReplicatedTensor should take an optional process_group indicating the replication environment.

Yeah I agree we should have ReplicatedTensor tight with a replication env (a pg), but I didn't do this initially bc it requires the Tensor to hold a metadata as its member, which torch.Tensor.make_subclass does not work in that way. Let me see if I can change it to make_wrapper_subclass and define __torch_dispatch__ together with __torch_function__

I actually found a way to propagate the field even with make_subclass, so no need to define __torch_dispatch__ yet, just updated the PR

pritamdamania87 · 2022-03-15T01:25:09Z

torch/distributed/_shard/replicated_tensor.py

+        # back to tensor subclasses, where in our case, we need to control the output type
+        # base on the inter-op rules we defined.
+        with torch._C.DisableTorchFunction():
+            rs = func(*new_args, **new_kwargs)


Why do we need new_args and new_kwargs here? Can't we jus pass in args and kwargs?

pritamdamania87 · 2022-03-15T01:33:11Z

torch/distributed/_shard/replicated_tensor.py

+
+            return rs
+
+    def validate(self, process_group=None) -> bool:


process_group should be passed in during construction of ReplicatedTensor.

pritamdamania87 · 2022-03-15T01:34:31Z

torch/distributed/_shard/replicated_tensor.py

+            rs = func(*new_args, **new_kwargs)
+            if func in get_default_nowrap_functions():
+                return rs
+            if not has_tensor and isinstance(rs, torch.Tensor) and not isinstance(rs, cls):


If we are adding two ReplicatedTensors here, shouldn't we validate they are on the same PG? I feel this check might be more clear if we assert all args are ReplicatedTensor and only in that case we return a ReplicatedTensor using rs.as_subclass. In all other cases, we return rs.

pritamdamania87 · 2022-03-15T01:35:51Z

test/distributed/_shard/test_replicated_tensor.py

+        # validate it's a replicated tensor by checking values on all rank
+        validated = replica_tensor.validate()
+        self.assertEqual(validated, True)
+        self.assertEqual(replica_tensor + 2, torch.ones(3, 3) * 6)


nit: validate type of replica_tensor + 2

I guess the type of replica_tensor + 2 should becomes tensor instead of replicated tensor

pritamdamania87 · 2022-03-15T01:36:18Z

test/distributed/_shard/test_replicated_tensor.py

+        replica_tensor1 = ReplicatedTensor(local_tensor * 4)
+        replica_tensor2 = ReplicatedTensor(local_tensor * 6)
+
+        new_tensor = replica_tensor1 * replica_tensor2
+        self.assertTrue(isinstance(new_tensor, ReplicatedTensor))
+        self.assertEqual(new_tensor, torch.ones(3, 3) * 24)
+


This should work only if PGs are same.

pritamdamania87 · 2022-03-15T01:37:14Z

test/distributed/_shard/test_replicated_tensor.py

+    def test_replicated_tensor_inter_op_tensor(self):
+        local_tensor = torch.ones(3, 3, device=f"cuda:{self.rank}") * 4
+        replica_tensor = ReplicatedTensor(local_tensor)
+


Are these two interops coming in follow up PRs?

ShardedTensor + ReplicatedTensor

PartialTensor + ReplicatedTensor

Yep, ShardedTensor + ReplicatedTensor coming in the next PR, PartialTensor should be a follow up PR as well.

Add ReplicatedTensor, a ReplicatedTensor is a type of tensor that have the same value on all ranks across the world_size. ReplicatedTensor is a :class:`~torch.Tensor` subclass, and it could be used together with ShardedTensor/Tensor together to express different types of computation. The inter-op rules defined as (using torch.add as an example op): ReplicatedTensor + ReplicatedTensor = ReplicatedTensor ReplicatedTensor + torch.Tensor = torch.Tensor ReplicatedTensor + ShardedTensor = ShardedTensor We also added a `validate()` API to help user validate if a replicated tensor on certain process_group is truly replicated or not. TODO: next PR gonna add ShardedTensor/PartialTensor logic to handle ReplicatedTensor. Differential Revision: [D34529374](https://our.internmc.facebook.com/intern/diff/D34529374/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34529374/)! [ghstack-poisoned]

pritamdamania87

Looks good! Just a minor comment regarding some refactoring.

pritamdamania87 · 2022-03-21T18:03:02Z

torch/distributed/_shard/replicated_tensor.py

                if isinstance(v, ShardedTensor):
                    # redispatch to ShardedTensor
-                    # TODO: handle ShardedTensor inter-op with ReplicatedTensor
+                    # TODO: handle ShardedTensor/PartialTensor inter-op with ReplicatedTensor
                    return v.__torch_function__(func, types, args, kwargs)
+                if isinstance(v, ReplicatedTensor):
+                    if replicated_pg is None:
+                        replicated_pg = v.process_group
+                    elif replicated_pg != v.process_group:
+                        raise RuntimeError(
+                            f"ReplicatedTensor operands must be in the same process group "
+                            f"in torch function '{func.__name__}', but found at least two "
+                            f"ReplicatedTensor operands in different process groups! ")
                else:
-                    new_kwargs[k] = v
+                    all_replicated = False


nit: repeated code between args and kwargs, maybe create a simple inline helper function and dedup this.

Add ReplicatedTensor, a ReplicatedTensor is a type of tensor that have the same value on all ranks across the world_size. ReplicatedTensor is a :class:`~torch.Tensor` subclass, and it could be used together with ShardedTensor/Tensor together to express different types of computation. The inter-op rules defined as (using torch.add as an example op): ReplicatedTensor + ReplicatedTensor = ReplicatedTensor ReplicatedTensor + torch.Tensor = torch.Tensor ReplicatedTensor + ShardedTensor = ShardedTensor We also added a `validate()` API to help user validate if a replicated tensor on certain process_group is truly replicated or not. TODO: next PR gonna add ShardedTensor/PartialTensor logic to handle ReplicatedTensor. Differential Revision: [D34529374](https://our.internmc.facebook.com/intern/diff/D34529374/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34529374/)! [ghstack-poisoned]

Summary: Pull Request resolved: #73529 Add ReplicatedTensor, a ReplicatedTensor is a type of tensor that have the same value on all ranks across the world_size. ReplicatedTensor is a :class:`~torch.Tensor` subclass, and it could be used together with ShardedTensor/Tensor together to express different types of computation. The inter-op rules defined as (using torch.add as an example op): ReplicatedTensor + ReplicatedTensor = ReplicatedTensor ReplicatedTensor + torch.Tensor = torch.Tensor ReplicatedTensor + ShardedTensor = ShardedTensor We also added a `validate()` API to help user validate if a replicated tensor on certain process_group is truly replicated or not. TODO: next PR gonna add ShardedTensor/PartialTensor logic to handle ReplicatedTensor. ghstack-source-id: 152064781 Test Plan: test_replicated_tensor Reviewed By: pritamdamania87, fduwjj Differential Revision: D34529374 fbshipit-source-id: 16ccb300e9f9c47ac29a17eb6d46d029ab7d60b8

github-actions · 2022-03-24T12:41:51Z

Hey @wanchaol.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: Pull Request resolved: #73529 Add ReplicatedTensor, a ReplicatedTensor is a type of tensor that have the same value on all ranks across the world_size. ReplicatedTensor is a :class:`~torch.Tensor` subclass, and it could be used together with ShardedTensor/Tensor together to express different types of computation. The inter-op rules defined as (using torch.add as an example op): ReplicatedTensor + ReplicatedTensor = ReplicatedTensor ReplicatedTensor + torch.Tensor = torch.Tensor ReplicatedTensor + ShardedTensor = ShardedTensor We also added a `validate()` API to help user validate if a replicated tensor on certain process_group is truly replicated or not. TODO: next PR gonna add ShardedTensor/PartialTensor logic to handle ReplicatedTensor. ghstack-source-id: 152064781 Test Plan: test_replicated_tensor Reviewed By: pritamdamania87, fduwjj Differential Revision: D34529374 fbshipit-source-id: 16ccb300e9f9c47ac29a17eb6d46d029ab7d60b8 (cherry picked from commit 44f4e11)

wanchaol requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners February 28, 2022 20:04

pytorch-bot bot added the ciflow/default label Feb 28, 2022

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 28, 2022

wanchaol requested a review from fduwjj February 28, 2022 20:04

wanchaol mentioned this pull request Mar 2, 2022

[shard] Add basic math ops to ShardedTensor and add ReplicatedTensor inter-op #73703

Closed

fduwjj approved these changes Mar 9, 2022

View reviewed changes

pritamdamania87 reviewed Mar 15, 2022

View reviewed changes

wanchaol requested a review from pritamdamania87 March 17, 2022 17:34

pritamdamania87 approved these changes Mar 21, 2022

View reviewed changes

suo removed the ciflow/default label Mar 22, 2022

wanchaol added 2 commits March 23, 2022 12:14

pytorchmergebot closed this in 0524b28 Mar 24, 2022

pritamdamania87 mentioned this pull request Mar 26, 2022

Add ReplicatedParameter and ReplicatedTensor to DDP. #74787

Closed

facebook-github-bot deleted the gh/wanchaol/204/head branch March 27, 2022 14:17

wanchaol added sharded_tensor release notes: distributed (sharded) release notes category labels Apr 29, 2022

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

		setattr(module, param_name, st)


		def _replicate_tensor(tensor: torch.Tensor) -> ReplicatedTensor:

[shard] Add ReplicatedTensor #73529

[shard] Add ReplicatedTensor #73529

Uh oh!

Conversation

wanchaol commented Feb 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Feb 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pull / pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge) (1/1)

Uh oh!

pytorch-bot bot commented Feb 28, 2022

⚛️ CI Flow

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fduwjj Mar 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wanchaol Mar 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 24, 2022

Uh oh!

Reviewers

wanchaol commented Feb 28, 2022 •

edited

Loading

facebook-github-bot commented Feb 28, 2022 •

edited

Loading

fduwjj Mar 16, 2022 •

edited

Loading

wanchaol Mar 16, 2022 •

edited

Loading