Update docs about DP and DDP for CUDA (#35063)

zasdfgbnm · gchanan · commit 752c129fa184 · 2020-03-25T11:18:17.000-04:00
Summary: We should recommend DDP instead of DP. Hope we can also cherry-pick this for 1.5 Pull Request resolved: #35063 Differential Revision: D20549621 Pulled By: ngimel fbshipit-source-id: 86b1b2134664065cc6070ea4212895f993eaf543
diff --git a/docs/source/distributed.rst b/docs/source/distributed.rst
@@ -395,6 +395,8 @@ of 16
 .. autofunction:: all_gather_multigpu
 
 
+.. _distributed-launch:
+
 Launch utility
 --------------
 
diff --git a/docs/source/notes/cuda.rst b/docs/source/notes/cuda.rst
@@ -306,20 +306,30 @@ to overlap data transfers with computation.
 You can make the :class:`~torch.utils.data.DataLoader` return batches placed in
 pinned memory by passing ``pin_memory=True`` to its constructor.
 
-.. _cuda-nn-dataparallel-instead:
+.. _cuda-nn-ddp-instead:
 
-Use nn.DataParallel instead of multiprocessing
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Most use cases involving batched inputs and multiple GPUs should default to
-using :class:`~torch.nn.DataParallel` to utilize more than one GPU. Even with
-the GIL, a single Python process can saturate multiple GPUs.
-
-As of version 0.1.9, large numbers of GPUs (8+) might not be fully utilized.
-However, this is a known issue that is under active development. As always,
-test your use case.
+using :class:`~torch.nn.parallel.DistributedDataParallel` to utilize more
+than one GPU.
 
 There are significant caveats to using CUDA models with
 :mod:`~torch.multiprocessing`; unless care is taken to meet the data handling
 requirements exactly, it is likely that your program will have incorrect or
 undefined behavior.
+
+It is recommended to use :class:`~torch.nn.parallel.DistributedDataParallel`,
+instead of :class:`~torch.nn.DataParallel` to do multi-GPU training, even if
+there is only a single node.
+
+The difference between :class:`~torch.nn.parallel.DistributedDataParallel` and
+:class:`~torch.nn.DataParallel` is: :class:`~torch.nn.parallel.DistributedDataParallel`
+uses multiprocessing where a process is created for each GPU, while
+:class:`~torch.nn.DataParallel` uses multithreading. By using multiprocessing,
+each GPU has its dedicated process, this avoids the performance overhead caused
+by GIL of Python interpreter. 
+
+If you use :class:`~torch.nn.parallel.DistributedDataParallel`, you could use 
+`torch.distributed.launch` utility to launch your program, see :ref:`distributed-launch`.
diff --git a/docs/source/notes/multiprocessing.rst b/docs/source/notes/multiprocessing.rst
@@ -45,7 +45,7 @@ the consumer process has references to the tensor, and the refcounting can not
 save you if the consumer process exits abnormally via a fatal signal. See
 :ref:`this section <multiprocessing-cuda-sharing-details>`.
 
-See also: :ref:`cuda-nn-dataparallel-instead`
+See also: :ref:`cuda-nn-ddp-instead`
 
 
 Best practices and tips
diff --git a/torch/nn/parallel/data_parallel.py b/torch/nn/parallel/data_parallel.py
@@ -45,7 +45,10 @@ class DataParallel(Module):
 
     The batch size should be larger than the number of GPUs used.
 
-    See also: :ref:`cuda-nn-dataparallel-instead`
+    .. warning::
+        It is recommended to use :class:`~torch.nn.parallel.DistributedDataParallel`,
+        instead of this class, to do multi-GPU training, even if there is only a single
+        node. See: :ref:`cuda-nn-ddp-instead` and :ref:`ddp`.
 
     Arbitrary positional and keyword inputs are allowed to be passed into
     DataParallel but some types are specially handled. tensors will be
diff --git a/torch/nn/parallel/distributed.py b/torch/nn/parallel/distributed.py
@@ -42,7 +42,7 @@ class DistributedDataParallel(Module):
 
     The batch size should be larger than the number of GPUs used locally.
 
-    See also: :ref:`distributed-basics` and :ref:`cuda-nn-dataparallel-instead`.
+    See also: :ref:`distributed-basics` and :ref:`cuda-nn-ddp-instead`.
     The same constraints on input as in :class:`torch.nn.DataParallel` apply.
 
     Creation of this class requires that ``torch.distributed`` to be already