Skip to content

Conversation

@weiyangfb
Copy link
Contributor

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@weiyangfb
Copy link
Contributor Author

@pytorchbot retest this please

1 similar comment
@weiyangfb
Copy link
Contributor Author

@pytorchbot retest this please

@weiyangfb weiyangfb added the ready for review (this tag is deprecated) All PRs are ready for review unless they are draft, WIP, or have undismissed requested changes label Aug 14, 2018
Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I wonder if those checks shouldn't be in the functional interface instead?

@ssnl
Copy link
Collaborator

ssnl commented Aug 28, 2018

Yeah, I agree with @fmassa . I'm also thinking if this should better be in ATen now that @apaszke has moved RNNs into c++.

@weiyangfb weiyangfb removed the ready for review (this tag is deprecated) All PRs are ready for review unless they are draft, WIP, or have undismissed requested changes label Aug 29, 2018
@weiyangfb weiyangfb force-pushed the lstm_input_device_type branch from e5d097e to 2658720 Compare August 29, 2018 20:33
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@weiyangfb
Copy link
Contributor Author

@pytorchbot retest this please

@weiyangfb
Copy link
Contributor Author

@fmassa @ssnl I already moved the checks into ATen. Let me know if it looks reasonable.

This comment was marked as off-topic.

This comment was marked as off-topic.

@apaszke
Copy link
Contributor

apaszke commented Sep 4, 2018

Also, it might be a better idea to push the checks to the cuDNN path, because otherwise we'll end up repeating them later anyway in the autograd code.

@weiyangfb
Copy link
Contributor Author

@apaszke I moved the checks to cuDNN path, and also keep those at non-cuDNN path

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused now. What I meant is that the device checks are really strictly necessary only in the cuDNN path, but what you did in here is to add the in both paths, and made the cuDNN path pass through checks twice.

This comment was marked as off-topic.

This comment was marked as off-topic.

@weiyangfb weiyangfb force-pushed the lstm_input_device_type branch 4 times, most recently from e617a89 to e6ce1a0 Compare September 16, 2018 07:02
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ssnl
Copy link
Collaborator

ssnl commented Sep 17, 2018

@apaszke I think it is reasonable to check device consistency in both cudnn and noncudnn code though.

@apaszke
Copy link
Contributor

apaszke commented Sep 17, 2018

@ssnl but the devices will be checked in the native path anyway, since every single function we call will verify them.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@ngimel
Copy link
Collaborator

ngimel commented Sep 17, 2018

#11680 is also related.

@ssnl
Copy link
Collaborator

ssnl commented Sep 17, 2018

@apaszke Yes, I agree that the noncudnn path check is redundant. But it would be nice to give users a better error message. I'm fine with either actually.

@weiyangfb weiyangfb force-pushed the lstm_input_device_type branch 3 times, most recently from 6ee12bb to 04491ff Compare September 17, 2018 20:44
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This comment was marked as off-topic.

This comment was marked as off-topic.

@apaszke
Copy link
Contributor

apaszke commented Sep 18, 2018

@ssnl note that the error message we can give in the C++ API is not very helpful anyway. The weights have very complex names in Python, and I don't think we'll be able to reproduce them easily.

@weiyangfb weiyangfb force-pushed the lstm_input_device_type branch from 723b058 to 26e166e Compare September 18, 2018 18:15
2. add check_device() function
@weiyangfb weiyangfb force-pushed the lstm_input_device_type branch from 26e166e to e8fff55 Compare September 18, 2018 20:03
Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to clean up check_tensors to use at::Device instead of unnecessarily comparing everything manually.

auto check_tensors = [&](const std::string& name, const Tensor& t) {
if (!t.defined()) return;
auto t_device = t.device();
bool t_device_is_cuda = t_device.is_cuda();

This comment was marked as off-topic.

This comment was marked as off-topic.

}

for (auto p : params) {
// if (!p.defined()) continue;

This comment was marked as off-topic.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Sep 19, 2018
Summary:
- fixes #9534
Pull Request resolved: pytorch/pytorch#10185

Differential Revision: D9141222

Pulled By: weiyangfb

fbshipit-source-id: bb652e42cc15917019df080d6bce2926b18f3476
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CPU hidden state tensor in GPU lstm layer causes CUDA corruption

7 participants