Skip to content

torch.nn.DataParallel causes incorrect gradients #23938

@xiaoguai0992

Description

@xiaoguai0992

Bug Report

Issue description

I have a model which has two nn.Conv2d modules, and I only use the first one of them in 'forward'.

In general, after executing 'loss.backward()', all weights' gradients of the second Conv2d (the unused one) should be 'None'.

Without nn.DataParallel, I got the correct result (conv2.weight.grad is None).

However with nn.DataParallel, the conv2.weight.grad is a zero tensor instead of None. So that if I run optimizer.step() after backward, weight_decay and momentum will be accumulated for the unused parameters, which causes unexpected results. I hope those gradients of unused parameters keep 'None' instead of a zero tensor.

I have a sample fix for this issue temporarily, but that might cause other problems (when real p.grad = zeros).

loss.backward()
for p in model.parameters():
    if torch.sum(torch.abs(p)) == 0.0:
        p.grad = None
optimizer.step()

So why does this problem occur? And how to fix it correctly?

Code example

See https://gist.github.com/xiaoguai0992/db8742c3fa7a5e02be36e64180693752

The output of the code should be:

Testing non-dataparallel.
conv1.weight, p.grad is None = False
conv1.bias, p.grad is None = False
conv2.weight, p.grad is None = True
conv2.bias, p.grad is None = True
Testing dataparallel
module.conv1.weight, p.grad is None = False
module.conv1.bias, p.grad is None = False
module.conv2.weight, p.grad is None = False
module.conv2.bias, p.grad is None = False
Testing repaired version
module.conv1.weight, p.grad is None = False
module.conv1.bias, p.grad is None = False
module.conv2.weight, p.grad is None = True
module.conv2.bias, p.grad is None = True

System Info

PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti

Nvidia driver version: 418.40.04
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.16.2
[pip3] torch==1.1.0
[pip3] torchvision==0.2.2.post3
[conda] Could not collect

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: autogradRelated to torch.autograd, and the autograd engine in generaloncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions