Fix output_nr not incremented correctly #4812

ssnl · 2018-01-23T19:36:55Z

output_nr is not incremented properly in rebase_history and set_history when some tensors are undefined. This causes the autograd engine incorrectly putting input tensors at wrong indices in InputBuffer. See the following extremely simple double backward example that reproduces the bug:

import torch
import torch.nn as nn
from torch.autograd import Variable, grad

conv = nn.Conv2d(1, 10, 5)
input = Variable(torch.randn(1,1,32,32))
loss1 = conv(input).sum()
grad_bias, = grad(loss1, conv.bias, create_graph=True)
loss2 = grad_bias.sum()
loss2.backward()

Because input doesn't require gradient, the ggW and ggb terms are set to incorrect indices, and it throws this weird error message:

RuntimeError: Expected 1-dimensional input for 1-dimensional weight [10], but got 
input of size [1, 1, 32, 32] instead

Why is this not detected in our tests: our double backward tests usually sets all parameters to requires_grad=True. Therefore the case where a backward function call returns an undefined tensor is not tested.

An issue has been submitted on testing with more diverse configurations: #4813.

Thanks @ezyang for helping me finding the cause.

Relevant forum post: https://discuss.pytorch.org/t/autograd-grad-dimension-error/12083/8

ezyang · 2018-01-23T21:33:11Z

Sorry about miscommunicating; can we get a hard coded test for this particular regression? (@apaszke, you think this would be good to have, right?)

apaszke · 2018-01-23T22:03:35Z

Ouch. It would be good, but testing all configurations can make the tests a lot slower (2^{number of inputs} is quite a lot) 😕

ssnl · 2018-01-23T23:52:49Z

While updating an existing test to cover this case, I found another bug:

>>> x = Variable(torch.randn(1, 1, 2, 2), requires_grad=True)
>>> w = Variable(torch.randn(1, 1, 2, 2), requires_grad=False)
>>> b = Variable(torch.randn(1), requires_grad=True)
>>> F.conv2d(x, w, b).backward()
>>> b.grad
Variable containing:
 0
[torch.FloatTensor of size 1]

>>> w.requires_grad = True
>>> F.conv2d(x, w, b).backward()
>>> b.grad
Variable containing:
 1
[torch.FloatTensor of size 1]

This is caused by this check here only covering grad_weight_. https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/nn_parse.py#L331

I attempted to fix by including all params in the check, but grad_weight is assumed to exist in THNN. I will find a fix for this tomorrow.

ezyang · 2018-01-24T02:14:29Z

Ick. THNN bindings: the gift that keeps giving X(

…meters if any param (not just weight) requires grad in parse_nn.py

…arameters with only bias requiring grad

ssnl · 2018-01-25T00:10:26Z

This is ready for review now! There are changes to all THNN functions that update more than 1 parameters in accGradParameters. They are all convolution functions. The changes to each file mainly are

add to shapeCheck an additional indicator argument int weight_nullable that is only 1 ( True) in accGradParameters.
change accGradParameters so it works if gradWeight is NULL.
change the confusing int batch variable, which is really a boolean indicating whether the input is in batch mode, to int is_batch.
fix some helpers that should be named as new* but are not.
move contiguity check out of shapeCheck and prevent a few unnecessary isContiguous and newContiguous calls.

aten/src/THCUNN/generic/SpatialFullDilatedConvolution.cu

+             "bias tensor has to be contiguous");
  input = THCTensor_(newContiguous)(state, input);
  weight = THCTensor_(newContiguous)(state, weight);
-  bias = bias ? THCTensor_(newContiguous)(state, bias) : bias;


aten/src/THCUNN/generic/SpatialFullDilatedConvolution.cu

+  } else if (gradBias != NULL) {
+    nOutputPlane = THCTensor_(size)(state, gradBias, 0);
+  } else {
+    return;


aten/src/THCUNN/generic/VolumetricFullDilatedConvolution.cu

-            "expected for weight, but got: %s");
-  THArgCheck(THCTensor_(isContiguous)(state, weight), 4,
-         "weight tensor has to be contiguous");
-  THArgCheck(!bias || THCTensor_(isContiguous)(state, bias), 5,


apaszke

Not a full review. Just noticed a few things

aten/src/ATen/native/Convolution.cpp

    } else if (dim == 5) {
      return at::thnn_conv_transpose3d(
-        input, weight, bias,
+        input, weight, kernel_size, bias,


aten/src/THCUNN/generic/SpatialConvolutionMM.cu

+                    "2D or 4D weight tensor expected, but got: %s");
+    if (bias != NULL) {
+      THCUNN_check_dim_size(state, bias, 1, 0, weight->size[0]);
+    }


aten/src/THCUNN/generic/SpatialConvolutionMM.cu

+  THCUNN_assertSameGPU(state, 5, input, gradOutput, gradWeight, gradBias, columns, ones);
+  if (gradWeight) {
+    THArgCheck(THCTensor_(isContiguous)(state, gradWeight), 4, "gradWeight needs to be contiguous");
  }


aten/src/THCUNN/generic/SpatialConvolutionMM.cu


  int freeWeight = 0;
-  if (weight->nDimension == 4) {
+  if (weight && weight->nDimension == 4) {


ssnl · 2018-01-31T22:23:46Z

@pytorchbot retest this please.

houseroad · 2018-01-31T22:28:25Z

@onnxbot add to whitelist

ssnl · 2018-02-01T00:44:01Z

@pytorchbot retest this please

ssnl · 2018-02-01T17:09:37Z

@pytorchbot retest this please

ssnl · 2018-02-01T17:44:24Z

@pytorchbot retest this please

ssnl · 2018-02-02T16:23:52Z

Would really appreciate if someone can review this. There are a lot lines changed, but many of them are just similar changes on different THNN files. I also split into multiple commits to make it clearer. If needed, let me know how I can make this easier to review. :)

soumith · 2018-02-02T17:30:29Z

looking

Summary: Pull Request resolved: pytorch/glow#4812 if no compilation options are passed, default to c-step fixed the FC and batchmatmul implementations to match C-step fixed the fakelowp map calling to make sure we use the fp32 substitution of operators updated the accumulator test to make it pass with fp32 Test Plan: fakelowp tests glow/test/numerics net_runner Reviewed By: jfix71 Differential Revision: D23086534 fbshipit-source-id: 3fbb8c4055bb190becb39ce8cdff6671f8558734

onnxbot-worker-3 mentioned this pull request Jan 23, 2018

[auto] pytorch-pr-4812 onnxbot/onnx-fb-universe#377

Closed

ssnl mentioned this pull request Jan 23, 2018

Improve tests to cover more configurations #4813

Closed

ezyang added the in progress label Jan 24, 2018

fix output_nr not incremented correctly

ddb42bb

ssnl force-pushed the fix_output_nr branch from 4fe399e to ebc2456 Compare January 24, 2018 19:56

update test_conv_double_backward to cover this case; call accGradPara…

8686f21

…meters if any param (not just weight) requires grad in parse_nn.py

ssnl force-pushed the fix_output_nr branch 7 times, most recently from ee63332 to 5e81cdd Compare January 24, 2018 23:37

update Spatial/VolumetricFull(Dilated)Convolution to support accGradP…

a5dd107

…arameters with only bias requiring grad

ssnl force-pushed the fix_output_nr branch 2 times, most recently from 4adf61b to 6f30587 Compare January 24, 2018 23:56

ssnl added 2 commits January 24, 2018 19:08

Spatial/VolumetricConvolutionMM

98d6a7e

Spatial/VolumetricDilatedConvolution

fdfb281

ssnl force-pushed the fix_output_nr branch from 6f30587 to fdfb281 Compare January 25, 2018 00:08

ezyang reviewed Jan 25, 2018

View reviewed changes

aten/src/THCUNN/generic/SpatialFullDilatedConvolution.cu

} else if (gradBias != NULL) {

nOutputPlane = THCTensor_(size)(state, gradBias, 0);

} else {

return;

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

ezyang reviewed Jan 25, 2018

View reviewed changes

apaszke reviewed Jan 25, 2018

View reviewed changes

fmassa reviewed Jan 25, 2018

View reviewed changes

aten/src/THCUNN/generic/SpatialConvolutionMM.cu Outdated

int freeWeight = 0;

if (weight->nDimension == 4) {

if (weight && weight->nDimension == 4) {

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

address @fmassa 's comments

3a88ed4

soumith merged commit f23feca into pytorch:master Feb 2, 2018

ssnl deleted the fix_output_nr branch February 2, 2018 18:04

soumith added 0.3.1 and removed in progress 0.3.1 labels Feb 5, 2018

ssnl mentioned this pull request Feb 8, 2018

Cherry pick dataloader issue fix to 0.3.1 #5140

Merged

ezyang added the open source label Jun 24, 2019

Fix output_nr not incremented correctly #4812

Fix output_nr not incremented correctly #4812

Uh oh!

Conversation

ssnl commented Jan 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Jan 23, 2018

Uh oh!

apaszke commented Jan 23, 2018

Uh oh!

ssnl commented Jan 23, 2018

Uh oh!

ezyang commented Jan 24, 2018

Uh oh!

ssnl commented Jan 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ssnl commented Jan 31, 2018

Uh oh!

houseroad commented Jan 31, 2018

Uh oh!

ssnl commented Feb 1, 2018

Uh oh!

ssnl commented Feb 1, 2018

Uh oh!

ssnl commented Feb 1, 2018

Uh oh!

ssnl commented Feb 2, 2018

Uh oh!

soumith commented Feb 2, 2018

Uh oh!

Reviewers

ssnl commented Jan 23, 2018 •

edited

Loading

ssnl commented Jan 25, 2018 •

edited

Loading