-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix 64-bit indexing in GridSampler #41923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 29a8131 (more details on the Dr. CI page):
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 69 times. |
22d239e to
a78e8c3
Compare
8561870 to
bb2d0a6
Compare
|
@zou3519 do you think you'd be able to review this? Reassign it back to me if you cannot. |
|
Yeah, I can take a look |
What is the decision to be made here? |
|
The decision is: a) Include a 64-bit indexed kernel (increasing binary size), or b) raise an error when 32-bit indexing would overflow. |
fa9fda3 to
c61b5ff
Compare
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to take me a while to read through but here are some initial comments for the testing
fe423f7 to
fa74869
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read through the CUDA kernel and the logic looks good to me. I think we should try to add a test for the backward case of 64-bit indexing to make sure that works as expected. I still haven't gone through the CPU kernels yet (will do that tomorrow)
452dab6 to
34e0a10
Compare
b18aea4 to
60d031c
Compare
test/test_nn.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this number come from? Looking at the test, I would have expected:
- for the im tensor: 32769 * 65536 * element_size
- for the
small_imagetensor: 32769 * element_size
Everything else looks pretty small
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment for how we got 32769 * (65536 + 3 * 65536 / 128) * torch.tensor([], dtype=dtype).element_size() would be nice for future devs looking to modify the test
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the CPU and CUDA implementations look correct to me. I had some comments and questions around the testing. I think after we beef up the testing then this should be good to go.
aten/src/ATen/native/GridSampler.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note std::round had to be changed to std::nearbyint which matches the rounding behaviour of Vec256<float>::round. This incompatibility will also be an issue with CUDA and the 3D version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the bright side... no one has complained about this, despite the vectorized path behavior being around for ~2 years. We should file an issue sometime about the inconsistency.
test/test_nn.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1e-10 bumped to be above float epsilon because it rounds differently from 0.0.
24e1d66 to
68cf358
Compare
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new tests for the cpu fallback look great, so here are some last comments. I'm going to make another pass through the PR, but I think it looks good
test/test_nn.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe provide a little more context here by writing a note? The reason why we support the unvectorized CPU fallback is because it is used for 64-bit indexing for fp32 inputs so this is a parity test that the unvectorized fallback works as advertised.
Furthermore, it would be nice to refer back to this note in the other changes to test_grid_sample. For example, as a code reader, I would be wondering why we're converting input and grid to float32 in places below (the answer is because the fallback only really gets used for float32)
test/test_nn.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already compare the gradients in test, I don't think there is a need to do that again here. My guess is that gradcheck adds some extra tests, but what concretely I don't know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that gradcheck compares against a numerical jacobian calculated from the change in the outputs after taking small "steps" in the input space. It doesn't work with the fallback here because it requires double precision to make those small steps.
test/test_nn.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment for how we got 32769 * (65536 + 3 * 65536 / 128) * torch.tensor([], dtype=dtype).element_size() would be nice for future devs looking to modify the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove empty like at top of file?
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. @peterbell10, thank you for going through all of the comments and putting this PR into order, there were a lot of moving parts.
Could you rebase the PR onto master so that we could get some signal from the CI? There was a commit from yesterday that broke CI but that should have been reverted
913a753 to
29a8131
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
(FYI - I am still trying to land this PR. Will report back if anything goes wrong) |
Fixes #41656
For the CPU version, this is a regression introduced in #10980 which vectorized the
grid_sampler_2dimplementation. It uses the AVX2 gather intrinsic which forfloatrequires 32-bit indexing to match the number of floats in the AVX register. There is also ani64gather_psvariant but this only utilizes half of the vector width so would be expected to give worse performance in the more likely case where 32-bit indexing is acceptable. So, I've left the optimised AVX version as-is and reinstated the old non-vectorized version as a fallback.For the CUDA version, this operation has never supported 32-bit indexing so this isn't a regression. I've templated the kernel on index type and added 64-bit variants. Although I gather in some places a simple
TORCH_CHECK(canUse32BitIndexMath(...))is used instead. So, there is a decision to be made here.