checking device types of input and weights at RNN #10185

weiyangfb · 2018-08-02T22:16:06Z

fixes CPU hidden state tensor in GPU lstm layer causes CUDA corruption #9534

facebook-github-bot

weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

weiyangfb · 2018-08-03T00:19:45Z

@pytorchbot retest this please

weiyangfb · 2018-08-03T06:41:41Z

@pytorchbot retest this please

fmassa

LGTM, but I wonder if those checks shouldn't be in the functional interface instead?

ssnl · 2018-08-28T17:03:05Z

Yeah, I agree with @fmassa . I'm also thinking if this should better be in ATen now that @apaszke has moved RNNs into c++.

facebook-github-bot

weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

weiyangfb · 2018-08-30T07:04:46Z

@pytorchbot retest this please

weiyangfb · 2018-08-30T18:02:54Z

@fmassa @ssnl I already moved the checks into ATen. Let me know if it looks reasonable.

aten/src/ATen/native/RNN.cpp

apaszke · 2018-09-04T20:09:46Z

Also, it might be a better idea to push the checks to the cuDNN path, because otherwise we'll end up repeating them later anyway in the autograd code.

weiyangfb · 2018-09-05T20:36:42Z

@apaszke I moved the checks to cuDNN path, and also keep those at non-cuDNN path

aten/src/ATen/native/RNN.h

apaszke

I'm confused now. What I meant is that the device checks are really strictly necessary only in the cuDNN path, but what you did in here is to add the in both paths, and made the cuDNN path pass through checks twice.

aten/src/ATen/native/cudnn/RNN.cpp

facebook-github-bot

weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ssnl · 2018-09-17T16:15:45Z

@apaszke I think it is reasonable to check device consistency in both cudnn and noncudnn code though.

apaszke · 2018-09-17T16:29:38Z

@ssnl but the devices will be checked in the native path anyway, since every single function we call will verify them.

aten/src/ATen/native/RNN.h

aten/src/ATen/native/RNN.cpp

aten/src/ATen/native/RNN.h

ngimel · 2018-09-17T16:40:51Z

#11680 is also related.

ssnl · 2018-09-17T16:54:29Z

@apaszke Yes, I agree that the noncudnn path check is redundant. But it would be nice to give users a better error message. I'm fine with either actually.

facebook-github-bot

weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

aten/src/ATen/native/cudnn/RNN.cpp

apaszke · 2018-09-18T14:30:01Z

@ssnl note that the error message we can give in the C++ API is not very helpful anyway. The weights have very complex names in Python, and I don't think we'll be able to reproduce them easily.

2. add check_device() function

apaszke

Would be good to clean up check_tensors to use at::Device instead of unnecessarily comparing everything manually.

aten/src/ATen/native/RNN.h

+  auto check_tensors = [&](const std::string& name, const Tensor& t) {
+    if (!t.defined()) return;
+    auto t_device = t.device();
+    bool t_device_is_cuda = t_device.is_cuda();


aten/src/ATen/native/RNN.h

+  }

+  for (auto p : params) {
+    // if (!p.defined()) continue;


facebook-github-bot

weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: - fixes #9534 Pull Request resolved: pytorch/pytorch#10185 Differential Revision: D9141222 Pulled By: weiyangfb fbshipit-source-id: bb652e42cc15917019df080d6bce2926b18f3476

weiyangfb requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners August 2, 2018 22:16

facebook-github-bot reviewed Aug 2, 2018

View reviewed changes

weiyangfb added the ready for review (this tag is deprecated) All PRs are ready for review unless they are draft, WIP, or have undismissed requested changes label Aug 14, 2018

fmassa reviewed Aug 28, 2018

View reviewed changes

weiyangfb removed the ready for review (this tag is deprecated) All PRs are ready for review unless they are draft, WIP, or have undismissed requested changes label Aug 29, 2018

weiyangfb force-pushed the lstm_input_device_type branch from e5d097e to 2658720 Compare August 29, 2018 20:33

facebook-github-bot reviewed Aug 29, 2018

View reviewed changes

ssnl reviewed Aug 31, 2018

View reviewed changes

aten/src/ATen/native/RNN.cpp Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

ssnl reviewed Sep 6, 2018

View reviewed changes

aten/src/ATen/native/RNN.h Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

apaszke suggested changes Sep 6, 2018

View reviewed changes

aten/src/ATen/native/cudnn/RNN.cpp Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

weiyangfb force-pushed the lstm_input_device_type branch 4 times, most recently from e617a89 to e6ce1a0 Compare September 16, 2018 07:02

facebook-github-bot reviewed Sep 17, 2018

View reviewed changes

apaszke reviewed Sep 17, 2018

View reviewed changes

ngimel reviewed Sep 17, 2018

View reviewed changes

aten/src/ATen/native/RNN.h Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

weiyangfb force-pushed the lstm_input_device_type branch 3 times, most recently from 6ee12bb to 04491ff Compare September 17, 2018 20:44

facebook-github-bot reviewed Sep 17, 2018

View reviewed changes

apaszke reviewed Sep 18, 2018

View reviewed changes

aten/src/ATen/native/cudnn/RNN.cpp Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

weiyangfb force-pushed the lstm_input_device_type branch from 723b058 to 26e166e Compare September 18, 2018 18:15

1. checking device types at _cudnn_rnn

e8fff55

2. add check_device() function

weiyangfb force-pushed the lstm_input_device_type branch from 26e166e to e8fff55 Compare September 18, 2018 20:03

apaszke approved these changes Sep 18, 2018

View reviewed changes

address comments

de67311

facebook-github-bot reviewed Sep 18, 2018

View reviewed changes

inline check_device

18150ac

facebook-github-bot reviewed Sep 19, 2018

View reviewed changes

facebook-github-bot closed this in 8aedc27 Sep 19, 2018

ezyang added the merged label Jun 26, 2019

peterbell10 mentioned this pull request Apr 5, 2020

CUDNN_STATUS_EXECUTION_FAILED with RNN on GPU #5213

Closed

checking device types of input and weights at RNN #10185

checking device types of input and weights at RNN #10185

Uh oh!

Conversation

weiyangfb commented Aug 2, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

weiyangfb commented Aug 3, 2018

Uh oh!

weiyangfb commented Aug 3, 2018

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

ssnl commented Aug 28, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

weiyangfb commented Aug 30, 2018

Uh oh!

weiyangfb commented Aug 30, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

apaszke commented Sep 4, 2018

Uh oh!

weiyangfb commented Sep 5, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ssnl commented Sep 17, 2018

Uh oh!

apaszke commented Sep 17, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ngimel commented Sep 17, 2018

Uh oh!

ssnl commented Sep 17, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

apaszke commented Sep 18, 2018

Uh oh!