Skip to content

Second order gradient cuda error #20465

@michaelklachko

Description

@michaelklachko

🐛 Bug

Convnet training on GPU: when penalizing gradient growth (backpropagating a gradient of a gradient) the following error happens:

/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 3]: block: [40,0,0], thread: [31,0,0] Assertion `indexValue >= 0 && indexValue < src.sizes[dim]` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/generic/THCTensorScatterGather.cu line=71 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "cudafail.py", line 47, in <module>
    grad_sum.backward(retain_graph=False)
  File "/home/michael/miniconda2/envs/pt/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/michael/miniconda2/envs/pt/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/generic/THCTensorScatterGather.cu:71

To Reproduce

import torch
from torch import nn
from torchvision import datasets, transforms

class Net(nn.Module):
	def __init__(self):
		super(Net, self).__init__()
		self.conv1 = nn.Conv2d(3, 20, kernel_size=5, bias=False)
		self.conv2 = nn.Conv2d(20, 40, kernel_size=5, bias=False)
		self.linear1 = nn.Linear(40 * 5 * 5, 300, bias=False)
		self.linear2 = nn.Linear(300, 10, bias=False)
		self.pool = nn.MaxPool2d(2, 2)
		self.relu = nn.ReLU()

	def forward(self, input):
		x = self.relu(self.pool(self.conv1(input)))
		x = self.relu(self.pool(self.conv2(x)))
		x = x.view(x.size(0), -1)
		x = self.relu(self.linear1(x))
		return self.linear2(x)

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True, num_workers=8)

model = Net().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1, momentum=0, nesterov=False)

for epoch in range(100):
        print(epoch)
	model.train()
	for i, (images, labels) in enumerate(trainloader, 0):
		images = images.cuda()
		labels = labels.cuda()
		outputs = model(images)
		loss = nn.CrossEntropyLoss()(outputs, labels)
		optimizer.zero_grad()
		loss.backward(retain_graph=True)

		grads = torch.autograd.grad(loss, model.parameters(), create_graph=True)
		grad_sum = 0
		for grad in grads:
			grad_sum += 0.1 * grad.pow(2).sum()
		grad_sum.backward(retain_graph=False)

		optimizer.step()

Environment

PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: version 3.6.2

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration:
GPU 0: TITAN X (Pascal)
GPU 1: TITAN X (Pascal)
GPU 2: TITAN X (Pascal)
GPU 3: TITAN X (Pascal)

Nvidia driver version: 418.56
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.3
/usr/local/cuda-9.0/lib64/libcudnn.so.7.2.1

Versions of relevant libraries:
[pip3] numpy==1.15.0
[pip3] torch==1.1.0
[pip3] torchfile==0.1.0
[pip3] torchvision==0.2.2
[conda] blas 1.0 mkl
[conda] cuda92 1.0 0 pytorch
[conda] mkl 2018.0.3 1
[conda] mkl_fft 1.0.4 py36h4414c95_1
[conda] mkl_random 1.0.1 py36h4414c95_1
[conda] pytorch 1.1.0 py3.6_cuda10.0.130_cudnn7.5.1_0 pytorch
[conda] torchfile 0.1.0 py_0 conda-forge
[conda] torchvision 0.2.2 py_3 pytorch

Additional context

If I change the coefficient in grad_sum += 0.1 * grad.pow(2).sum() from 0.1 to 10 the error does NOT happen (at least not before ~40 epochs), same if I reduce the learning rate from 1 to 0.001.

Metadata

Metadata

Assignees

Labels

high prioritymodule: autogradRelated to torch.autograd, and the autograd engine in generalmodule: cudaRelated to torch.cuda, and CUDA support in generalmodule: double backwardsProblem is related to double backwards definition on an operatormodule: nnRelated to torch.nntriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions