Skip to content

Commit 89c2b50

Browse files
Stonesjtufacebook-github-bot
authored andcommitted
Grad clip for parameters on different devices (#9302)
Summary: I'm trying to write a multi-gpu network by pipelining some layers onto different GPUs. However, the current gradient clip requires all the parameters to locate in the same device. The overhead of CUDA launch is reduced since the scalar calculation is performed on CPU, but it introduces extra data transfers. No performance regression is observed by running the following snippet: ```python import time import torch module = torch.nn.Sequential( torch.nn.LSTM(1024, 1024), torch.nn.LSTM(256, 256), torch.nn.Linear(100, 10000), ).cuda() torch.nn.utils.clip_grad_norm_(module.parameters(), 1) torch.cuda.synchronize() start = time.time() for _ in range(1000): torch.nn.utils.clip_grad_norm_(module.parameters(), 1) torch.cuda.synchronize() time_elapse = time.time() - start print('{} ms per clip'.format(time_elapse)) ``` Pull Request resolved: #9302 Differential Revision: D8781551 Pulled By: soumith fbshipit-source-id: 9d76d01fe0531927f770a16b9523872a7e08e927
1 parent 1597fc5 commit 89c2b50

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

torch/nn/utils/clip_grad.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,12 +29,12 @@ def clip_grad_norm_(parameters, max_norm, norm_type=2):
2929
total_norm = 0
3030
for p in parameters:
3131
param_norm = p.grad.data.norm(norm_type)
32-
total_norm += param_norm ** norm_type
32+
total_norm += param_norm.item() ** norm_type
3333
total_norm = total_norm ** (1. / norm_type)
3434
clip_coef = max_norm / (total_norm + 1e-6)
3535
if clip_coef < 1:
3636
for p in parameters:
37-
p.grad.data.mul_(clip_coef.item())
37+
p.grad.data.mul_(clip_coef)
3838
return total_norm
3939

4040

0 commit comments

Comments
 (0)