Emphasize all DDP forward() outputs must participate in computing loss (#20586)

mrshenli · facebook-github-bot · commit fa4ca4e70e98 · 2019-05-17T07:35:49.000-07:00
Summary: CC borguz chenyangyu1988 Pull Request resolved: #20586 Reviewed By: ezyang Differential Revision: D15373674 Pulled By: mrshenli fbshipit-source-id: b986918b3592616a9bcc88fba1b8fd53016f68d7
diff --git a/torch/csrc/distributed/c10d/reducer.cpp b/torch/csrc/distributed/c10d/reducer.cpp
@@ -395,17 +395,19 @@ void Reducer::prepare_for_backward(
         "starting a new one. ",
         "",
         "This error indicates that your module has parameters that were ",
-        "not used in producing its output (the return value of `forward`). ",
+        "not used in producing loss. ",
         "",
-        "You can enable unused parameter detection by passing the keyword "
+        "You can enable unused parameter detection by (1) passing the keyword "
         "argument `find_unused_parameters=True` to ",
-        "`torch.nn.parallel.DistributedDataParallel`. ",
+        "`torch.nn.parallel.DistributedDataParallel`; (2) making sure all ",
+        "`forward` function outputs participate in calculating loss. "
         "",
-        "If you already have this argument set, then the distributed data ",
-        "parallel module wasn't able to locate the output tensors in the ",
+        "If you already have done the above two steps, then the distributed ",
+        "data parallel module wasn't able to locate the output tensors in the ",
         "return value of your module's `forward` function. ",
-        "Please include the structure of the return value of `forward` of ",
-        "your module when reporting this issue (e.g. list, dict, iterable).");
+        "Please include the loss function and the structure of the return ",
+        "value of `forward` of your module when reporting this issue (e.g. ",
+        "list, dict, iterable).");
   }
 
   // Reset accounting.
diff --git a/torch/nn/parallel/distributed.py b/torch/nn/parallel/distributed.py
@@ -197,8 +197,16 @@ class DistributedDataParallel(Module):
                                        module's ``forward`` function.
                                        Parameters that don't receive gradients as
                                        part of this graph are preemptively marked
-                                       as being ready to be reduced.
-                                       (default: ``False``)
+                                       as being ready to be reduced. Note that all
+                                       ``forward`` outputs that are derived from
+                                       module parameters must participate in
+                                       calculating loss and later the gradient
+                                       computation. If they don't, this wrapper will
+                                       hang waiting for autograd to produce gradients
+                                       for those parameters. Any outputs derived from
+                                       module parameters that are otherwise unused can
+                                       be detached from the autograd graph using
+                                       ``torch.Tensor.detach``. (default: ``False``)
         check_reduction: when setting to ``True``, it enables DistributedDataParallel
                          to automatically check if the previous iteration's
                          backward reductions were successfully issued at the