-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Print tensor shapes and convolution parameters when cuDNN exception is thrown #45023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
There is this known cuDNN error in 8.0.0 and 8.0.1. It was fixed in 8.0.2, which is the version we are currently running on CI. It would be hard to test this feature, but you can take this diff and test that on cuDNN 8.0.0 or 8.0.1. diff --git a/torch/testing/_internal/common_nn.py b/torch/testing/_internal/common_nn.py
index 85faea1957634..70c49c185ed6a 100644
--- a/torch/testing/_internal/common_nn.py
+++ b/torch/testing/_internal/common_nn.py
@@ -2114,6 +2114,14 @@ def fractional_max_pool3d_test(test_case):
cudnn=True,
check_with_long_tensor=True,
),
+ dict(
+ fullname='Conv3d_groups_40999',
+ constructor=lambda: nn.Conv3d(2, 4, kernel_size=3, groups=2),
+ cpp_constructor_args='torch::nn::Conv3dOptions(2, 4, 3).groups(2)',
+ input_size=(1, 2, 3, 3, 3),
+ cudnn=True,
+ check_with_long_tensor=True,
+ ),
dict(
fullname='Conv3d_dilated',
constructor=lambda: nn.Conv3d(3, 4, kernel_size=2, dilation=2),Example stack trace after this PR |
|
cc @ptrblck cc @zasdfgbnm to check if there are any extra parameters that need to be printed |
💊 CI failures summary and remediationsAs of commit c3a6719 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
|
My suggestion is to format error message such that it can easily be used to reproduce the error in a standalone example: You cannot always guarantee unfortunately that the algo will be the same, but if deterministic/benchmark/tf32 are the same chances are good. |
Hmm, descriptor pointer address print is already there for 3 years. I'm not sure if they are needed. Data pointer address would definitely be needed. In the example I showed above, you can see the weight pointer is misaligned to 16-byte position, which caused the original cuDNN problem. |
@ngimel , generation of python repro code snippet is added. Is this ready to go? |
This reverts commit 6134510.
| ss << "groups=" << args.params.groups << ")" << to_channels_last << "\n"; | ||
| ss << "net = net.cuda()." << partial_dtype << "()\n"; | ||
| ss << "out = net(data)\n"; | ||
| ss << "out.backward(torch.randn_like(out))\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include this only when the error message is for backward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that is a good point. However, if we only want to include this in a backward pass, the codegen logic would be much more complicated. Besides, let users run an extra backward pass is basically no harm.
This comment has been minimized.
This comment has been minimized.
|
Related #37160 |
| } | ||
| }; | ||
|
|
||
| std::ostream& operator<<(std::ostream & out, const FilterDescriptor& d); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to mark this with TORCH_CUDA_API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TORCH_CUDA_API here introduces failed test, (not sure if they are relevant) like
Sep 25 19:10:06 Exception occurred:
Sep 25 19:10:06 File "/opt/conda/lib/python3.6/site-packages/sphinx/domains/cpp.py", line 6099, in _parse_type
Sep 25 19:10:06 raise self._make_multi_error(prevErrors, header)
Sep 25 19:10:06 sphinx.util.cfamily.DefinitionError: Error when parsing function declaration.
Sep 25 19:10:06 If the function has no return type:
Sep 25 19:10:06 Error in declarator or parameters-and-qualifiers
Sep 25 19:10:06 Invalid C++ declaration: Expecting "(" in parameters-and-qualifiers. [error at 15]
Sep 25 19:10:06 TORCH_CUDA_API std::ostream & operator<< (std::ostream &out, const FilterDescriptor &d)
Sep 25 19:10:06 ---------------^
Sep 25 19:10:06 If the function has a return type:
Sep 25 19:10:06 Error in declarator or parameters-and-qualifiers
Sep 25 19:10:06 If pointer to member declarator:
Sep 25 19:10:06 Invalid C++ declaration: Expected '::' in pointer to member (function). [error at 28]
Sep 25 19:10:06 TORCH_CUDA_API std::ostream & operator<< (std::ostream &out, const FilterDescriptor &d)
Sep 25 19:10:06 ----------------------------^
Sep 25 19:10:06 If declarator-id:
Sep 25 19:10:06 Invalid C++ declaration: Expecting "(" in parameters-and-qualifiers. [error at 28]
Sep 25 19:10:06 TORCH_CUDA_API std::ostream & operator<< (std::ostream &out, const FilterDescriptor &d)
Sep 25 19:10:06 ----------------------------^
https://dr.pytorch.org/api/view-log-full?build_id=137561284
https://dr.pytorch.org/api/view-log-full?build_id=137564136
The operator<< TensorDescriptor next to it doesn't have TORCH_CUDA_API, so I guess we don't need it?
| } | ||
| }; | ||
|
|
||
| std::string repro_from_args(const ConvolutionArgs& args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little ambivalent about this. On the one hand, it is cool to produce repros in this style. But on the other hand, ensuring that the sample code here doesn't go stale is not going to be too easy. A compromise might be to produce the data in question here into a generic machine readable format (e.g., JSON) and then have a little Python script on the other side that takes that data and reinterprets it into an actual sequence of torch calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with adding this in as a one off, but if we start doing this pattern in other places in our codebase, we should think about some general infrastructure for doing this sort of thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also have this as an exposed function and test that produced python code is runnable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I checked your proposal at #37160 , which is very nice and comprehensive. Since convolution is mostly guaranteed to be backward compatible and interface stable, this sample code here would work for a long time. I'm glad to refactor this in the way you mentioned when there is a better "repro code generation mechanism" available, and I'm looking forward to it.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Originally introduced in #45023. When I was doing test in the original PR, it was a Conv3d, so this problem was not discovered. Arrays in `ConvolutionParams` have a fixed length of 3 or 5. This is because `max_dim` is set as a constexpr of 3, regardless of Conv2d or Conv3d. The current code will make some error message be weird. See below in the comments. https://github.com/pytorch/pytorch/blob/9201c37d020007979e144693d86c8e8599e2fd8f/aten/src/ATen/native/cudnn/Conv.cpp#L212-L226 Pull Request resolved: #45729 Reviewed By: mruberry Differential Revision: D24081542 Pulled By: ngimel fbshipit-source-id: 141f8946f4d0db63a723131775731272abeaa6ab
Summary: Originally introduced in pytorch#45023. When I was doing test in the original PR, it was a Conv3d, so this problem was not discovered. Arrays in `ConvolutionParams` have a fixed length of 3 or 5. This is because `max_dim` is set as a constexpr of 3, regardless of Conv2d or Conv3d. The current code will make some error message be weird. See below in the comments. https://github.com/pytorch/pytorch/blob/9201c37d020007979e144693d86c8e8599e2fd8f/aten/src/ATen/native/cudnn/Conv.cpp#L212-L226 Pull Request resolved: pytorch#45729 Reviewed By: mruberry Differential Revision: D24081542 Pulled By: ngimel fbshipit-source-id: 141f8946f4d0db63a723131775731272abeaa6ab
Summary: Originally introduced in #45023. When I was doing test in the original PR, it was a Conv3d, so this problem was not discovered. Arrays in `ConvolutionParams` have a fixed length of 3 or 5. This is because `max_dim` is set as a constexpr of 3, regardless of Conv2d or Conv3d. The current code will make some error message be weird. See below in the comments. https://github.com/pytorch/pytorch/blob/9201c37d020007979e144693d86c8e8599e2fd8f/aten/src/ATen/native/cudnn/Conv.cpp#L212-L226 Pull Request resolved: #45729 Reviewed By: mruberry Differential Revision: D24081542 Pulled By: ngimel fbshipit-source-id: 141f8946f4d0db63a723131775731272abeaa6ab
Originally proposed at #44473 (comment) by @colesbury .
This PR adds the functionality to print relevant tensor shapes and convolution parameters along with the stack trace once a cuDNN exception is thrown.