Grad clip for parameters on different devices #9302

Stonesjtu · 2018-07-10T12:00:02Z

I'm trying to write a multi-gpu network by pipelining some layers onto different GPUs. However, the current gradient clip requires all the parameters to locate in the same device.

The overhead of CUDA launch is reduced since the scalar calculation is performed on CPU, but it introduces extra data transfers.

No performance regression is observed by running the following snippet:

import time

import torch

module = torch.nn.Sequential(
    torch.nn.LSTM(1024, 1024),
    torch.nn.LSTM(256, 256),
    torch.nn.Linear(100, 10000),
).cuda()

# warming-up
torch.nn.utils.clip_grad_norm_(module.parameters(), 1)
torch.cuda.synchronize()
start = time.time()
for _ in range(1000):
    torch.nn.utils.clip_grad_norm_(module.parameters(), 1)
torch.cuda.synchronize()
time_elapse = time.time() - start
print('{} ms per clip'.format(time_elapse))

facebook-github-bot

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

soumith · 2018-07-10T13:17:51Z

thank you, this looks good!

apaszke · 2018-07-10T13:35:25Z

Wouldn’t it be much faster to just use a defaultdict to accumulate norms on different devices and only them transfer them all to CPU? That would at least cover the common case of all params on a single GPU

Stonesjtu · 2018-07-10T13:57:49Z

@apaszke I've thought about that, but counter-intuitively on my environment, the scalar addition on a single device does not run as fast as expected.

My envs:
1080 + Xeon E5 2620 V4.

Summary: I'm trying to write a multi-gpu network by pipelining some layers onto different GPUs. However, the current gradient clip requires all the parameters to locate in the same device. The overhead of CUDA launch is reduced since the scalar calculation is performed on CPU, but it introduces extra data transfers. No performance regression is observed by running the following snippet: ```python import time import torch module = torch.nn.Sequential( torch.nn.LSTM(1024, 1024), torch.nn.LSTM(256, 256), torch.nn.Linear(100, 10000), ).cuda() torch.nn.utils.clip_grad_norm_(module.parameters(), 1) torch.cuda.synchronize() start = time.time() for _ in range(1000): torch.nn.utils.clip_grad_norm_(module.parameters(), 1) torch.cuda.synchronize() time_elapse = time.time() - start print('{} ms per clip'.format(time_elapse)) ``` Pull Request resolved: pytorch#9302 Differential Revision: D8781551 Pulled By: soumith fbshipit-source-id: 9d76d01fe0531927f770a16b9523872a7e08e927

Grad clip for parameters on different devices

3b916e8

Stonesjtu requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners July 10, 2018 12:00

soumith approved these changes Jul 10, 2018

View reviewed changes

facebook-github-bot reviewed Jul 10, 2018

View reviewed changes

facebook-github-bot closed this in 89c2b50 Jul 10, 2018

ezyang added open source merged labels Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Grad clip for parameters on different devices #9302

Grad clip for parameters on different devices #9302

Uh oh!

Stonesjtu commented Jul 10, 2018

Uh oh!

facebook-github-bot left a comment

Uh oh!

soumith commented Jul 10, 2018

Uh oh!

apaszke commented Jul 10, 2018

Uh oh!

Stonesjtu commented Jul 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Grad clip for parameters on different devices #9302

Grad clip for parameters on different devices #9302

Uh oh!

Conversation

Stonesjtu commented Jul 10, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

soumith commented Jul 10, 2018

Uh oh!

apaszke commented Jul 10, 2018

Uh oh!

Stonesjtu commented Jul 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants