Commit 89c2b50
Grad clip for parameters on different devices (#9302)
Summary:
I'm trying to write a multi-gpu network by pipelining some layers onto different GPUs. However, the current gradient clip requires all the parameters to locate in the same device.
The overhead of CUDA launch is reduced since the scalar calculation is performed on CPU, but it introduces extra data transfers.
No performance regression is observed by running the following snippet:
```python
import time
import torch
module = torch.nn.Sequential(
torch.nn.LSTM(1024, 1024),
torch.nn.LSTM(256, 256),
torch.nn.Linear(100, 10000),
).cuda()
torch.nn.utils.clip_grad_norm_(module.parameters(), 1)
torch.cuda.synchronize()
start = time.time()
for _ in range(1000):
torch.nn.utils.clip_grad_norm_(module.parameters(), 1)
torch.cuda.synchronize()
time_elapse = time.time() - start
print('{} ms per clip'.format(time_elapse))
```
Pull Request resolved: #9302
Differential Revision: D8781551
Pulled By: soumith
fbshipit-source-id: 9d76d01fe0531927f770a16b9523872a7e08e9271 parent 1597fc5 commit 89c2b50
1 file changed
+2
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | | - | |
| 32 | + | |
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
37 | | - | |
| 37 | + | |
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
| |||
0 commit comments