Skip to content

Commit 752c129

Browse files
zasdfgbnmgchanan
authored andcommitted
Update docs about DP and DDP for CUDA (#35063)
Summary: We should recommend DDP instead of DP. Hope we can also cherry-pick this for 1.5 Pull Request resolved: #35063 Differential Revision: D20549621 Pulled By: ngimel fbshipit-source-id: 86b1b2134664065cc6070ea4212895f993eaf543
1 parent fb59a9c commit 752c129

File tree

5 files changed

+27
-12
lines changed

5 files changed

+27
-12
lines changed

docs/source/distributed.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -395,6 +395,8 @@ of 16
395395
.. autofunction:: all_gather_multigpu
396396

397397

398+
.. _distributed-launch:
399+
398400
Launch utility
399401
--------------
400402

docs/source/notes/cuda.rst

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -306,20 +306,30 @@ to overlap data transfers with computation.
306306
You can make the :class:`~torch.utils.data.DataLoader` return batches placed in
307307
pinned memory by passing ``pin_memory=True`` to its constructor.
308308

309-
.. _cuda-nn-dataparallel-instead:
309+
.. _cuda-nn-ddp-instead:
310310

311-
Use nn.DataParallel instead of multiprocessing
312-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
311+
Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel
312+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
313313

314314
Most use cases involving batched inputs and multiple GPUs should default to
315-
using :class:`~torch.nn.DataParallel` to utilize more than one GPU. Even with
316-
the GIL, a single Python process can saturate multiple GPUs.
317-
318-
As of version 0.1.9, large numbers of GPUs (8+) might not be fully utilized.
319-
However, this is a known issue that is under active development. As always,
320-
test your use case.
315+
using :class:`~torch.nn.parallel.DistributedDataParallel` to utilize more
316+
than one GPU.
321317

322318
There are significant caveats to using CUDA models with
323319
:mod:`~torch.multiprocessing`; unless care is taken to meet the data handling
324320
requirements exactly, it is likely that your program will have incorrect or
325321
undefined behavior.
322+
323+
It is recommended to use :class:`~torch.nn.parallel.DistributedDataParallel`,
324+
instead of :class:`~torch.nn.DataParallel` to do multi-GPU training, even if
325+
there is only a single node.
326+
327+
The difference between :class:`~torch.nn.parallel.DistributedDataParallel` and
328+
:class:`~torch.nn.DataParallel` is: :class:`~torch.nn.parallel.DistributedDataParallel`
329+
uses multiprocessing where a process is created for each GPU, while
330+
:class:`~torch.nn.DataParallel` uses multithreading. By using multiprocessing,
331+
each GPU has its dedicated process, this avoids the performance overhead caused
332+
by GIL of Python interpreter.
333+
334+
If you use :class:`~torch.nn.parallel.DistributedDataParallel`, you could use
335+
`torch.distributed.launch` utility to launch your program, see :ref:`distributed-launch`.

docs/source/notes/multiprocessing.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ the consumer process has references to the tensor, and the refcounting can not
4545
save you if the consumer process exits abnormally via a fatal signal. See
4646
:ref:`this section <multiprocessing-cuda-sharing-details>`.
4747

48-
See also: :ref:`cuda-nn-dataparallel-instead`
48+
See also: :ref:`cuda-nn-ddp-instead`
4949

5050

5151
Best practices and tips

torch/nn/parallel/data_parallel.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,10 @@ class DataParallel(Module):
4545
4646
The batch size should be larger than the number of GPUs used.
4747
48-
See also: :ref:`cuda-nn-dataparallel-instead`
48+
.. warning::
49+
It is recommended to use :class:`~torch.nn.parallel.DistributedDataParallel`,
50+
instead of this class, to do multi-GPU training, even if there is only a single
51+
node. See: :ref:`cuda-nn-ddp-instead` and :ref:`ddp`.
4952
5053
Arbitrary positional and keyword inputs are allowed to be passed into
5154
DataParallel but some types are specially handled. tensors will be

torch/nn/parallel/distributed.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ class DistributedDataParallel(Module):
4242
4343
The batch size should be larger than the number of GPUs used locally.
4444
45-
See also: :ref:`distributed-basics` and :ref:`cuda-nn-dataparallel-instead`.
45+
See also: :ref:`distributed-basics` and :ref:`cuda-nn-ddp-instead`.
4646
The same constraints on input as in :class:`torch.nn.DataParallel` apply.
4747
4848
Creation of this class requires that ``torch.distributed`` to be already

0 commit comments

Comments
 (0)