Small clarification to amp gradient penalty example (#44667)

Michael Carilli · facebook-github-bot · commit 2fd142a2efe7 · 2020-09-14T21:56:09.000-07:00
Summary: requested by https://discuss.pytorch.org/t/what-is-the-correct-way-of-computing-a-grad-penalty-using-amp/95827/3 Pull Request resolved: #44667 Reviewed By: mruberry Differential Revision: D23692768 Pulled By: ngimel fbshipit-source-id: 83c61b94e79ef9f86abed2cc066f188dce0c8456
diff --git a/docs/source/notes/amp_examples.rst b/docs/source/notes/amp_examples.rst
@@ -169,7 +169,9 @@ Here's an ordinary example of an L2 penalty without gradient scaling or autocast
             loss = loss_fn(output, target)
 
             # Creates gradients
-            grad_params = torch.autograd.grad(loss, model.parameters(), create_graph=True)
+            grad_params = torch.autograd.grad(outputs=loss,
+                                              inputs=model.parameters(),
+                                              create_graph=True)
 
             # Computes the penalty term and adds it to the loss
             grad_norm = 0
@@ -184,8 +186,8 @@ Here's an ordinary example of an L2 penalty without gradient scaling or autocast
 
             optimizer.step()
 
-To implement a gradient penalty *with* gradient scaling, the loss passed to
-:func:`torch.autograd.grad` should be scaled.  The resulting gradients
+To implement a gradient penalty *with* gradient scaling, the ``outputs`` Tensor(s)
+passed to :func:`torch.autograd.grad` should be scaled.  The resulting gradients
 will therefore be scaled, and should be unscaled before being combined to create the
 penalty value.
 
@@ -203,8 +205,10 @@ Here's how that looks for the same L2 penalty::
                 output = model(input)
                 loss = loss_fn(output, target)
 
-            # Scales the loss for autograd.grad's backward pass, resulting in scaled grad_params
-            scaled_grad_params = torch.autograd.grad(scaler.scale(loss), model.parameters(), create_graph=True)
+            # Scales the loss for autograd.grad's backward pass, producing scaled_grad_params
+            scaled_grad_params = torch.autograd.grad(outputs=scaler.scale(loss),
+                                                     inputs=model.parameters(),
+                                                     create_graph=True)
 
             # Creates unscaled grad_params before computing the penalty. scaled_grad_params are
             # not owned by any optimizer, so ordinary division is used instead of scaler.unscale_: