Grad clip for parameters on different devices (#9302)

Stonesjtu · facebook-github-bot · commit 89c2b50a15eb · 2018-07-10T07:56:55.000-07:00
Summary: I'm trying to write a multi-gpu network by pipelining some layers onto different GPUs. However, the current gradient clip requires all the parameters to locate in the same device. The overhead of CUDA launch is reduced since the scalar calculation is performed on CPU, but it introduces extra data transfers. No performance regression is observed by running the following snippet: ```python import time import torch module = torch.nn.Sequential( torch.nn.LSTM(1024, 1024), torch.nn.LSTM(256, 256), torch.nn.Linear(100, 10000), ).cuda() torch.nn.utils.clip_grad_norm_(module.parameters(), 1) torch.cuda.synchronize() start = time.time() for _ in range(1000): torch.nn.utils.clip_grad_norm_(module.parameters(), 1) torch.cuda.synchronize() time_elapse = time.time() - start print('{} ms per clip'.format(time_elapse)) ``` Pull Request resolved: #9302 Differential Revision: D8781551 Pulled By: soumith fbshipit-source-id: 9d76d01fe0531927f770a16b9523872a7e08e927
diff --git a/torch/nn/utils/clip_grad.py b/torch/nn/utils/clip_grad.py
@@ -29,12 +29,12 @@ def clip_grad_norm_(parameters, max_norm, norm_type=2):
         total_norm = 0
         for p in parameters:
             param_norm = p.grad.data.norm(norm_type)
-            total_norm += param_norm ** norm_type
+            total_norm += param_norm.item() ** norm_type
         total_norm = total_norm ** (1. / norm_type)
     clip_coef = max_norm / (total_norm + 1e-6)
     if clip_coef < 1:
         for p in parameters:
-            p.grad.data.mul_(clip_coef.item())
+            p.grad.data.mul_(clip_coef)
     return total_norm