[PT-D][Tensor parallelism] Add documentations for TP (#94421)

fduwjj · pytorchmergebot · commit 41e31892227b · 2023-02-09T02:31:06.000Z
This is far from completed and we will definitely polish it down the road. Pull Request resolved: #94421 Approved by: https://github.com/wz337
diff --git a/docs/source/distributed.tensor.parallel.rst b/docs/source/distributed.tensor.parallel.rst
@@ -1,7 +1,60 @@
 .. role:: hidden
     :class: hidden-section
 
-Tensor Parallelism
-========================
-.. py:module:: torch.distributed.tensor.parallel
+Tensor Parallelism - torch.distributed.tensor.parallel
+======================================================
+
+We built Tensor Parallelism(TP) on top of DistributedTensor(DTensor) and
+provide several Parallelism styles: Rowwise, Colwise and Pairwise Parallelism.
+
+.. warning ::
+    Tensor Parallelism is experimental and subject to change.
+
+The entrypoint to parallelize your module and using tensor parallelism is:
+
+.. automodule:: torch.distributed.tensor.parallel
+
 .. currentmodule:: torch.distributed.tensor.parallel
+
+.. autofunction::  parallelize_module
+
+Tensor Parallelism supports the following parallel styles:
+
+.. autoclass:: torch.distributed.tensor.parallel.style.RowwiseParallel
+  :members:
+
+.. autoclass:: torch.distributed.tensor.parallel.style.ColwiseParallel
+  :members:
+
+.. autoclass:: torch.distributed.tensor.parallel.style.PairwiseParallel
+  :members:
+
+Because we use DTensor within Tensor Parallelism, we need to specify the
+input and output placement of the module with DTensors so it can expectedly
+interacts with the module before and after. The followings are functions
+used for input/output preparation:
+
+
+.. currentmodule:: torch.distributed.tensor.parallel.style
+
+.. autofunction::  make_input_replicate_1d
+.. autofunction::  make_input_shard_1d
+.. autofunction::  make_input_shard_1d_last_dim
+.. autofunction::  make_output_replicate_1d
+.. autofunction::  make_output_tensor
+.. autofunction::  make_output_shard_1d
+
+Currently, there are some constraints which makes it hard for the `nn.MultiheadAttention`
+module to work out of box for Tensor Parallelism, so we built this multihead_attention
+module for Tensor Parallelism users. Also, in ``parallelize_module``, we automatically
+swap ``nn.MultiheadAttention`` to this custom module when specifying ``PairwiseParallel``.
+
+.. autoclass:: torch.distributed.tensor.parallel.multihead_attention_tp.TensorParallelMultiheadAttention
+  :members:
+
+We also enabled 2D parallelism to integrate with ``FullyShardedDataParallel``.
+Users just need to call the following API explicitly:
+
+
+.. currentmodule:: torch.distributed.tensor.parallel.fsdp
+.. autofunction::  is_available
diff --git a/torch/distributed/tensor/parallel/__init__.py b/torch/distributed/tensor/parallel/__init__.py
@@ -8,7 +8,7 @@
     ColwiseParallel,
     make_input_replicate_1d,
     make_input_shard_1d,
-    make_input_shard_1d_dim_last,
+    make_input_shard_1d_last_dim,
     make_output_replicate_1d,
     make_output_shard_1d,
     make_output_tensor,
@@ -25,7 +25,7 @@
     "TensorParallelMultiheadAttention",
     "make_input_replicate_1d",
     "make_input_shard_1d",
-    "make_input_shard_1d_dim_last",
+    "make_input_shard_1d_last_dim",
     "make_output_replicate_1d",
     "make_output_tensor",
     "make_output_shard_1d",
diff --git a/torch/distributed/tensor/parallel/style.py b/torch/distributed/tensor/parallel/style.py
@@ -18,7 +18,7 @@
     "PairwiseParallel",
     "make_input_replicate_1d",
     "make_input_shard_1d",
-    "make_input_shard_1d_dim_last",
+    "make_input_shard_1d_last_dim",
     "make_output_replicate_1d",
     "make_output_tensor",
     "make_output_shard_1d",
@@ -62,7 +62,7 @@ class RowwiseParallel(ParallelStyle):
     """
 
     def __init__(self) -> None:
-        super().__init__(make_input_shard_1d_dim_last, make_output_replicate_1d)
+        super().__init__(make_input_shard_1d_last_dim, make_output_replicate_1d)
 
 
 class ColwiseParallel(ParallelStyle):
@@ -112,7 +112,7 @@ def make_input_shard_1d(
         )
 
 
-def make_input_shard_1d_dim_last(
+def make_input_shard_1d_last_dim(
     input: Union[torch.Tensor, DTensor],
     device_mesh: Optional[DeviceMesh] = None,
 ) -> DTensor: