Skip to content

Conversation

@ssnl
Copy link
Collaborator

@ssnl ssnl commented Jan 23, 2018

output_nr is not incremented properly in rebase_history and set_history when some tensors are undefined. This causes the autograd engine incorrectly putting input tensors at wrong indices in InputBuffer. See the following extremely simple double backward example that reproduces the bug:

import torch
import torch.nn as nn
from torch.autograd import Variable, grad

conv = nn.Conv2d(1, 10, 5)
input = Variable(torch.randn(1,1,32,32))
loss1 = conv(input).sum()
grad_bias, = grad(loss1, conv.bias, create_graph=True)
loss2 = grad_bias.sum()
loss2.backward()

Because input doesn't require gradient, the ggW and ggb terms are set to incorrect indices, and it throws this weird error message:

RuntimeError: Expected 1-dimensional input for 1-dimensional weight [10], but got 
input of size [1, 1, 32, 32] instead

Why is this not detected in our tests: our double backward tests usually sets all parameters to requires_grad=True. Therefore the case where a backward function call returns an undefined tensor is not tested.

An issue has been submitted on testing with more diverse configurations: #4813.

Thanks @ezyang for helping me finding the cause.

Relevant forum post: https://discuss.pytorch.org/t/autograd-grad-dimension-error/12083/8

@ezyang
Copy link
Contributor

ezyang commented Jan 23, 2018

Sorry about miscommunicating; can we get a hard coded test for this particular regression? (@apaszke, you think this would be good to have, right?)

@apaszke
Copy link
Contributor

apaszke commented Jan 23, 2018

Ouch. It would be good, but testing all configurations can make the tests a lot slower (2^{number of inputs} is quite a lot) 😕

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 23, 2018

While updating an existing test to cover this case, I found another bug:

>>> x = Variable(torch.randn(1, 1, 2, 2), requires_grad=True)
>>> w = Variable(torch.randn(1, 1, 2, 2), requires_grad=False)
>>> b = Variable(torch.randn(1), requires_grad=True)
>>> F.conv2d(x, w, b).backward()
>>> b.grad
Variable containing:
 0
[torch.FloatTensor of size 1]

>>> w.requires_grad = True
>>> F.conv2d(x, w, b).backward()
>>> b.grad
Variable containing:
 1
[torch.FloatTensor of size 1]

This is caused by this check here only covering grad_weight_. https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/nn_parse.py#L331

I attempted to fix by including all params in the check, but grad_weight is assumed to exist in THNN. I will find a fix for this tomorrow.

@ezyang
Copy link
Contributor

ezyang commented Jan 24, 2018

Ick. THNN bindings: the gift that keeps giving X(

…meters if any param (not just weight) requires grad in parse_nn.py
@ssnl ssnl force-pushed the fix_output_nr branch 7 times, most recently from ee63332 to 5e81cdd Compare January 24, 2018 23:37
@ssnl ssnl force-pushed the fix_output_nr branch 2 times, most recently from 4adf61b to 6f30587 Compare January 24, 2018 23:56
@ssnl
Copy link
Collaborator Author

ssnl commented Jan 25, 2018

This is ready for review now! There are changes to all THNN functions that update more than 1 parameters in accGradParameters. They are all convolution functions. The changes to each file mainly are

  1. add to shapeCheck an additional indicator argument int weight_nullable that is only 1 ( True) in accGradParameters.
  2. change accGradParameters so it works if gradWeight is NULL.
  3. change the confusing int batch variable, which is really a boolean indicating whether the input is in batch mode, to int is_batch.
  4. fix some helpers that should be named as new* but are not.
  5. move contiguity check out of shapeCheck and prevent a few unnecessary isContiguous and newContiguous calls.

"bias tensor has to be contiguous");
input = THCTensor_(newContiguous)(state, input);
weight = THCTensor_(newContiguous)(state, weight);
bias = bias ? THCTensor_(newContiguous)(state, bias) : bias;

This comment was marked as off-topic.

This comment was marked as off-topic.

} else if (gradBias != NULL) {
nOutputPlane = THCTensor_(size)(state, gradBias, 0);
} else {
return;

This comment was marked as off-topic.

This comment was marked as off-topic.

"expected for weight, but got: %s");
THArgCheck(THCTensor_(isContiguous)(state, weight), 4,
"weight tensor has to be contiguous");
THArgCheck(!bias || THCTensor_(isContiguous)(state, bias), 5,

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review. Just noticed a few things

} else if (dim == 5) {
return at::thnn_conv_transpose3d(
input, weight, bias,
input, weight, kernel_size, bias,

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

"2D or 4D weight tensor expected, but got: %s");
if (bias != NULL) {
THCUNN_check_dim_size(state, bias, 1, 0, weight->size[0]);
}

This comment was marked as off-topic.

This comment was marked as off-topic.

THCUNN_assertSameGPU(state, 5, input, gradOutput, gradWeight, gradBias, columns, ones);
if (gradWeight) {
THArgCheck(THCTensor_(isContiguous)(state, gradWeight), 4, "gradWeight needs to be contiguous");
}

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.


int freeWeight = 0;
if (weight->nDimension == 4) {
if (weight && weight->nDimension == 4) {

This comment was marked as off-topic.

This comment was marked as off-topic.

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 31, 2018

@pytorchbot retest this please.

@houseroad
Copy link
Member

@onnxbot add to whitelist

@ssnl
Copy link
Collaborator Author

ssnl commented Feb 1, 2018

@pytorchbot retest this please

2 similar comments
@ssnl
Copy link
Collaborator Author

ssnl commented Feb 1, 2018

@pytorchbot retest this please

@ssnl
Copy link
Collaborator Author

ssnl commented Feb 1, 2018

@pytorchbot retest this please

@ssnl
Copy link
Collaborator Author

ssnl commented Feb 2, 2018

Would really appreciate if someone can review this. There are a lot lines changed, but many of them are just similar changes on different THNN files. I also split into multiple commits to make it clearer. If needed, let me know how I can make this easier to review. :)

@soumith
Copy link
Contributor

soumith commented Feb 2, 2018

looking

@soumith soumith merged commit f23feca into pytorch:master Feb 2, 2018
@ssnl ssnl deleted the fix_output_nr branch February 2, 2018 18:04
facebook-github-bot pushed a commit that referenced this pull request Aug 19, 2020
Summary:
Pull Request resolved: pytorch/glow#4812

if no compilation options are passed, default to c-step

fixed the FC and batchmatmul implementations to match C-step
fixed the fakelowp map calling to make sure we use the fp32 substitution of operators
updated the accumulator test to make it pass with fp32

Test Plan:
fakelowp tests
glow/test/numerics
net_runner

Reviewed By: jfix71

Differential Revision: D23086534

fbshipit-source-id: 3fbb8c4055bb190becb39ce8cdff6671f8558734
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants