Skip to content

Conversation

@peterbell10
Copy link
Collaborator

Fixes #41656

For the CPU version, this is a regression introduced in #10980 which vectorized the grid_sampler_2d implementation. It uses the AVX2 gather intrinsic which for float requires 32-bit indexing to match the number of floats in the AVX register. There is also an i64gather_ps variant but this only utilizes half of the vector width so would be expected to give worse performance in the more likely case where 32-bit indexing is acceptable. So, I've left the optimised AVX version as-is and reinstated the old non-vectorized version as a fallback.

For the CUDA version, this operation has never supported 32-bit indexing so this isn't a regression. I've templated the kernel on index type and added 64-bit variants. Although I gather in some places a simple TORCH_CHECK(canUse32BitIndexMath(...)) is used instead. So, there is a decision to be made here.

@dr-ci
Copy link

dr-ci bot commented Jul 23, 2020

💊 CI failures summary and remediations

As of commit 29a8131 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 69 times.

@peterbell10 peterbell10 force-pushed the gridsampler-64bit branch 4 times, most recently from 8561870 to bb2d0a6 Compare July 23, 2020 16:16
@mrshenli mrshenli added module: cuda Related to torch.cuda, and CUDA support in general module: nn Related to torch.nn triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 23, 2020
@ezyang ezyang requested a review from zou3519 July 27, 2020 14:29
@ezyang
Copy link
Contributor

ezyang commented Jul 27, 2020

@zou3519 do you think you'd be able to review this? Reassign it back to me if you cannot.

@zou3519
Copy link
Contributor

zou3519 commented Jul 27, 2020

Yeah, I can take a look

@zou3519
Copy link
Contributor

zou3519 commented Jul 28, 2020

For the CUDA version, this operation has never supported 32-bit indexing so this isn't a regression. I've templated the kernel on index type and added 64-bit variants. Although I gather in some places a simple TORCH_CHECK(canUse32BitIndexMath(...)) is used instead. So, there is a decision to be made here.

What is the decision to be made here?

@peterbell10
Copy link
Collaborator Author

The decision is: a) Include a 64-bit indexed kernel (increasing binary size), or b) raise an error when 32-bit indexing would overflow.

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to take me a while to read through but here are some initial comments for the testing

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through the CUDA kernel and the logic looks good to me. I think we should try to add a test for the backward case of 64-bit indexing to make sure that works as expected. I still haven't gone through the CPU kernels yet (will do that tomorrow)

test/test_nn.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this number come from? Looking at the test, I would have expected:

  • for the im tensor: 32769 * 65536 * element_size
  • for the small_image tensor: 32769 * element_size

Everything else looks pretty small

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment for how we got 32769 * (65536 + 3 * 65536 / 128) * torch.tensor([], dtype=dtype).element_size() would be nice for future devs looking to modify the test

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the CPU and CUDA implementations look correct to me. I had some comments and questions around the testing. I think after we beef up the testing then this should be good to go.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note std::round had to be changed to std::nearbyint which matches the rounding behaviour of Vec256<float>::round. This incompatibility will also be an issue with CUDA and the 3D version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the bright side... no one has complained about this, despite the vectorized path behavior being around for ~2 years. We should file an issue sometime about the inconsistency.

test/test_nn.py Outdated
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1e-10 bumped to be above float epsilon because it rounds differently from 0.0.

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests for the cpu fallback look great, so here are some last comments. I'm going to make another pass through the PR, but I think it looks good

test/test_nn.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe provide a little more context here by writing a note? The reason why we support the unvectorized CPU fallback is because it is used for 64-bit indexing for fp32 inputs so this is a parity test that the unvectorized fallback works as advertised.

Furthermore, it would be nice to refer back to this note in the other changes to test_grid_sample. For example, as a code reader, I would be wondering why we're converting input and grid to float32 in places below (the answer is because the fallback only really gets used for float32)

test/test_nn.py Outdated
Comment on lines 6968 to 6980
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already compare the gradients in test, I don't think there is a need to do that again here. My guess is that gradcheck adds some extra tests, but what concretely I don't know

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that gradcheck compares against a numerical jacobian calculated from the change in the outputs after taking small "steps" in the input space. It doesn't work with the fallback here because it requires double precision to make those small steps.

test/test_nn.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment for how we got 32769 * (65536 + 3 * 65536 / 128) * torch.tensor([], dtype=dtype).element_size() would be nice for future devs looking to modify the test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty like at top of file?

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. @peterbell10, thank you for going through all of the comments and putting this PR into order, there were a lot of moving parts.

Could you rebase the PR onto master so that we could get some signal from the CI? There was a commit from yesterday that broke CI but that should have been reverted

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@zou3519
Copy link
Contributor

zou3519 commented Aug 6, 2020

(FYI - I am still trying to land this PR. Will report back if anything goes wrong)

@facebook-github-bot
Copy link
Contributor

@zou3519 merged this pull request in 33519e1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cuda Related to torch.cuda, and CUDA support in general module: nn Related to torch.nn open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch.nn.functional.grid_sample segfaults on large inputs

6 participants