-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Preemptively test for out-of-order length. #13933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to understand where the segfault comes from, because I can’t see how this solution is safer than the previous algorithm.
|
"It would be good to understand where the segfault comes" I'm not an expert in C++ memory, but it seems to be an issue in how the stack gets unwound when the exception is thrown. The error message suggests a double-free. Stepping through in the debugger I see that it reaches the expected AT_CHECK, satisfies the condition, and throws the error in the original code as expected, but then segfaults when returning. I think failing faster is safer because there's nothing yet allocated that would need to be freed when the exception is thrown. |
Summary: torch.nn.utils.rnn.pack_padded_sequence segment fault if not in decreasing order pytorch#13324 We were seeing this segfault on throw, pre-emptively checking avoids this: *** Error in `/home/bvaughan/anaconda3/bin/python': double free or corruption (!prev): 0x00005555566e7510 *** Test Plan: Added unit test based on example provided in issue. Reviewers: Subscribers: Tasks: Tags:
4effde1 to
a14623f
Compare
|
Did ASAN report anything on this test? It's possible it is a stack unwinding problem but I don't see anything on the stack in this function that should cause the problem. ASAN might be able to say something better. |
|
Here's the full output running with ASAN, I haven't read through it in detail yet: In [6]: pack_padded_sequence(b_a, [22, 25])==213533==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60f000027c90 at pc 0x7f510d37d464 bp 0x7fff530ffab0 sp 0x7fff530ffaa8 0x60f000027c90 is located 0 bytes to the right of 176-byte region [0x60f000027be0,0x60f000027c90) SUMMARY: AddressSanitizer: heap-buffer-overflow caffe2/aten/src/ATen/native/PackedSequence.cpp:67 in at::native::_pack_padded_sequence(at::Tensor const&, at::Tensor const&, bool) |
|
I'm a bit confused that the original error was "double free or corruption," and when I stepped through the code it got to line 71 to throw before segfaulting ( |
|
Hi @nairbv and @apaszke, I think that I know what might be causing the exception. In Line 33, we have: at::Tensor batch_sizes_t = at::empty(lengths[0], _lengths.options());If the parameter So this code should be something like (I have no skill using the ATEN library :( ) at::Tensor batch_sizes_t = at::empty(at::max(lengths), _lengths.options());(I do no if am I right? Does it make sense for you guys? I did not test anything, just pass my eyes through the code, so the probability of my analysis is wrong is really high. Hope I could help. |
|
@igormq ah, yes, that does make sense, and much better explains why this PR avoided the segfault. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue, as @igormq discovered and @nairbv explained to me offline was that we were performing the check for sortedness too late and indexing pass the end of the batch_sizes_t tensor as a result, causing a buffer overflow.
this looks good to me, I had two minor comments in the code (please read them!). I think it would be nice in the future to remove the sortedness requirement but that needs discussion.
| (*batch_sizes++) = current_batch_size; | ||
| } | ||
| prev_l = l; | ||
| } else if (prev_l > l) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| AT_CHECK(lengths[batch_size - 1] > 0, | ||
| "Length of all samples has to be greater than 0, but found an element " | ||
| "in 'lengths' that is <= 0"); | ||
| for(auto i = 0; i < batch_size - 1; i++ ) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
found reason for segfault (see other comments)
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nairbv is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: torch.nn.utils.rnn.pack_padded_sequence segment fault if not in decreasing order #13324 We were seeing this segfault on throw, pre-emptively checking avoids this: *** Error in `/home/bvaughan/anaconda3/bin/python': double free or corruption (!prev): 0x00005555566e7510 *** Pull Request resolved: pytorch/pytorch#13933 Differential Revision: D13090389 Pulled By: nairbv fbshipit-source-id: 6f6b319e74cb55830be799e9c46bc33aa59256d8
Summary:
torch.nn.utils.rnn.pack_padded_sequence segment fault if not in
decreasing order #13324
We were seeing this segfault on throw, pre-emptively checking avoids
this:
*** Error in `/home/bvaughan/anaconda3/bin/python': double free or corruption (!prev): 0x00005555566e7510 ***
Test Plan:
Added unit test based on example provided in issue.
Reviewers:
Subscribers:
Tasks:
Tags: