-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🐛 Bug
Convnet training on GPU: when penalizing gradient growth (backpropagating a gradient of a gradient) the following error happens:
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 3]: block: [40,0,0], thread: [31,0,0] Assertion `indexValue >= 0 && indexValue < src.sizes[dim]` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/generic/THCTensorScatterGather.cu line=71 error=59 : device-side assert triggered
Traceback (most recent call last):
File "cudafail.py", line 47, in <module>
grad_sum.backward(retain_graph=False)
File "/home/michael/miniconda2/envs/pt/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/michael/miniconda2/envs/pt/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/generic/THCTensorScatterGather.cu:71
To Reproduce
import torch
from torch import nn
from torchvision import datasets, transforms
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 20, kernel_size=5, bias=False)
self.conv2 = nn.Conv2d(20, 40, kernel_size=5, bias=False)
self.linear1 = nn.Linear(40 * 5 * 5, 300, bias=False)
self.linear2 = nn.Linear(300, 10, bias=False)
self.pool = nn.MaxPool2d(2, 2)
self.relu = nn.ReLU()
def forward(self, input):
x = self.relu(self.pool(self.conv1(input)))
x = self.relu(self.pool(self.conv2(x)))
x = x.view(x.size(0), -1)
x = self.relu(self.linear1(x))
return self.linear2(x)
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True, num_workers=8)
model = Net().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1, momentum=0, nesterov=False)
for epoch in range(100):
print(epoch)
model.train()
for i, (images, labels) in enumerate(trainloader, 0):
images = images.cuda()
labels = labels.cuda()
outputs = model(images)
loss = nn.CrossEntropyLoss()(outputs, labels)
optimizer.zero_grad()
loss.backward(retain_graph=True)
grads = torch.autograd.grad(loss, model.parameters(), create_graph=True)
grad_sum = 0
for grad in grads:
grad_sum += 0.1 * grad.pow(2).sum()
grad_sum.backward(retain_graph=False)
optimizer.step()
Environment
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130
OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: version 3.6.2
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration:
GPU 0: TITAN X (Pascal)
GPU 1: TITAN X (Pascal)
GPU 2: TITAN X (Pascal)
GPU 3: TITAN X (Pascal)
Nvidia driver version: 418.56
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.3
/usr/local/cuda-9.0/lib64/libcudnn.so.7.2.1
Versions of relevant libraries:
[pip3] numpy==1.15.0
[pip3] torch==1.1.0
[pip3] torchfile==0.1.0
[pip3] torchvision==0.2.2
[conda] blas 1.0 mkl
[conda] cuda92 1.0 0 pytorch
[conda] mkl 2018.0.3 1
[conda] mkl_fft 1.0.4 py36h4414c95_1
[conda] mkl_random 1.0.1 py36h4414c95_1
[conda] pytorch 1.1.0 py3.6_cuda10.0.130_cudnn7.5.1_0 pytorch
[conda] torchfile 0.1.0 py_0 conda-forge
[conda] torchvision 0.2.2 py_3 pytorch
Additional context
If I change the coefficient in grad_sum += 0.1 * grad.pow(2).sum() from 0.1 to 10 the error does NOT happen (at least not before ~40 epochs), same if I reduce the learning rate from 1 to 0.001.