Fix 64-bit indexing in GridSampler #41923

peterbell10 · 2020-07-23T13:46:16Z

For the CPU version, this is a regression introduced in #10980 which vectorized the grid_sampler_2d implementation. It uses the AVX2 gather intrinsic which for float requires 32-bit indexing to match the number of floats in the AVX register. There is also an i64gather_ps variant but this only utilizes half of the vector width so would be expected to give worse performance in the more likely case where 32-bit indexing is acceptable. So, I've left the optimised AVX version as-is and reinstated the old non-vectorized version as a fallback.

For the CUDA version, this operation has never supported 32-bit indexing so this isn't a regression. I've templated the kernel on index type and added 64-bit variants. Although I gather in some places a simple TORCH_CHECK(canUse32BitIndexMath(...)) is used instead. So, there is a decision to be made here.

dr-ci · 2020-07-23T13:56:24Z

💊 CI failures summary and remediations

As of commit 29a8131 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed

Failed: pr/caffe2-pytorch-linux-xenial-rocm3.5.1-py3.6-test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 69 times.

aten/src/ATen/native/IndexingUtils.cpp

ezyang · 2020-07-27T14:29:49Z

@zou3519 do you think you'd be able to review this? Reassign it back to me if you cannot.

zou3519 · 2020-07-27T17:51:21Z

Yeah, I can take a look

zou3519 · 2020-07-28T19:45:02Z

For the CUDA version, this operation has never supported 32-bit indexing so this isn't a regression. I've templated the kernel on index type and added 64-bit variants. Although I gather in some places a simple TORCH_CHECK(canUse32BitIndexMath(...)) is used instead. So, there is a decision to be made here.

What is the decision to be made here?

peterbell10 · 2020-07-28T19:54:07Z

The decision is: a) Include a 64-bit indexed kernel (increasing binary size), or b) raise an error when 32-bit indexing would overflow.

zou3519

This is going to take me a while to read through but here are some initial comments for the testing

test/test_nn.py

zou3519

I read through the CUDA kernel and the logic looks good to me. I think we should try to add a test for the backward case of 64-bit indexing to make sure that works as expected. I still haven't gone through the CPU kernels yet (will do that tomorrow)

aten/src/ATen/native/cuda/GridSampler.cu

aten/src/ATen/cuda/detail/KernelUtils.h

aten/src/ATen/native/cuda/GridSampler.cu

aten/src/ATen/native/GridSampler.cpp

aten/src/ATen/native/cpu/GridSamplerKernel.cpp

zou3519 · 2020-07-31T19:56:24Z

test/test_nn.py

Where does this number come from? Looking at the test, I would have expected:

for the im tensor: 32769 * 65536 * element_size

for the small_image tensor: 32769 * element_size

Everything else looks pretty small

A comment for how we got 32769 * (65536 + 3 * 65536 / 128) * torch.tensor([], dtype=dtype).element_size() would be nice for future devs looking to modify the test

test/test_nn.py

zou3519

Both the CPU and CUDA implementations look correct to me. I had some comments and questions around the testing. I think after we beef up the testing then this should be good to go.

peterbell10 · 2020-08-03T00:55:17Z

aten/src/ATen/native/GridSampler.cpp

Note std::round had to be changed to std::nearbyint which matches the rounding behaviour of Vec256<float>::round. This incompatibility will also be an issue with CUDA and the 3D version.

On the bright side... no one has complained about this, despite the vectorized path behavior being around for ~2 years. We should file an issue sometime about the inconsistency.

peterbell10 · 2020-08-03T00:58:56Z

test/test_nn.py

1e-10 bumped to be above float epsilon because it rounds differently from 0.0.

zou3519

The new tests for the cpu fallback look great, so here are some last comments. I'm going to make another pass through the PR, but I think it looks good

zou3519 · 2020-08-03T20:27:21Z

test/test_nn.py

nit: Maybe provide a little more context here by writing a note? The reason why we support the unvectorized CPU fallback is because it is used for 64-bit indexing for fp32 inputs so this is a parity test that the unvectorized fallback works as advertised.

Furthermore, it would be nice to refer back to this note in the other changes to test_grid_sample. For example, as a code reader, I would be wondering why we're converting input and grid to float32 in places below (the answer is because the fallback only really gets used for float32)

zou3519 · 2020-08-03T20:38:32Z

test/test_nn.py

We already compare the gradients in test, I don't think there is a need to do that again here. My guess is that gradcheck adds some extra tests, but what concretely I don't know

My understanding is that gradcheck compares against a numerical jacobian calculated from the change in the outputs after taking small "steps" in the input space. It doesn't work with the fallback here because it requires double precision to make those small steps.

zou3519 · 2020-08-03T20:41:39Z

test/test_nn.py

A comment for how we got 32769 * (65536 + 3 * 65536 / 128) * torch.tensor([], dtype=dtype).element_size() would be nice for future devs looking to modify the test

zou3519 · 2020-08-03T20:44:31Z

aten/src/ATen/native/IndexingUtils.cpp

nit: remove empty like at top of file?

zou3519

This LGTM. @peterbell10, thank you for going through all of the comments and putting this PR into order, there were a lot of moving parts.

Could you rebase the PR onto master so that we could get some signal from the CI? There was a commit from yesterday that broke CI but that should have been reverted

facebook-github-bot

@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zou3519 · 2020-08-06T19:50:29Z

(FYI - I am still trying to land this PR. Will report back if anything goes wrong)

facebook-github-bot · 2020-08-07T00:12:42Z

@zou3519 merged this pull request in 33519e1.

peterbell10 added the open source label Jul 23, 2020

peterbell10 requested a review from ezyang July 23, 2020 13:46

peterbell10 force-pushed the gridsampler-64bit branch from 22d239e to a78e8c3 Compare July 23, 2020 14:25

peterbell10 commented Jul 23, 2020

View reviewed changes

aten/src/ATen/native/IndexingUtils.cpp Outdated Show resolved Hide resolved

peterbell10 force-pushed the gridsampler-64bit branch 4 times, most recently from 8561870 to bb2d0a6 Compare July 23, 2020 16:16

mrshenli added module: cuda Related to torch.cuda, and CUDA support in general module: nn Related to torch.nn triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 23, 2020

ezyang requested a review from zou3519 July 27, 2020 14:29

peterbell10 force-pushed the gridsampler-64bit branch from fa9fda3 to c61b5ff Compare July 28, 2020 19:59

zou3519 reviewed Jul 28, 2020

View reviewed changes

test/test_nn.py Outdated Show resolved Hide resolved

test/test_nn.py Outdated Show resolved Hide resolved

test/test_nn.py Outdated Show resolved Hide resolved

test/test_nn.py Outdated Show resolved Hide resolved

test/test_nn.py Outdated Show resolved Hide resolved

zou3519 mentioned this pull request Jul 28, 2020

torch.nn.functional.grid_sample segfaults on large inputs #41656

Closed

peterbell10 force-pushed the gridsampler-64bit branch from fe423f7 to fa74869 Compare July 28, 2020 21:38

zou3519 reviewed Jul 29, 2020

View reviewed changes

peterbell10 force-pushed the gridsampler-64bit branch from 452dab6 to 34e0a10 Compare July 29, 2020 23:44

zou3519 reviewed Jul 30, 2020

View reviewed changes

aten/src/ATen/native/GridSampler.cpp Outdated Show resolved Hide resolved

zou3519 reviewed Jul 30, 2020

View reviewed changes

aten/src/ATen/native/GridSampler.cpp Outdated Show resolved Hide resolved

peterbell10 commented Jul 30, 2020

View reviewed changes

aten/src/ATen/native/cpu/GridSamplerKernel.cpp Outdated Show resolved Hide resolved

peterbell10 force-pushed the gridsampler-64bit branch from b18aea4 to 60d031c Compare July 30, 2020 22:32

zou3519 reviewed Jul 31, 2020

View reviewed changes

test/test_nn.py Outdated Show resolved Hide resolved

zou3519 reviewed Jul 31, 2020

View reviewed changes

peterbell10 commented Aug 3, 2020

View reviewed changes

peterbell10 force-pushed the gridsampler-64bit branch from 24e1d66 to 68cf358 Compare August 3, 2020 01:06

zou3519 reviewed Aug 3, 2020

View reviewed changes

aten/src/ATen/native/IndexingUtils.cpp Outdated

Copy link

Contributor

zou3519 Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty like at top of file?

zou3519 approved these changes Aug 4, 2020

View reviewed changes

peterbell10 added 13 commits August 4, 2020 15:00

Add 64-bit indexing support to grid_sampler (CPU and CUDA)

7ae215e

Move canUse32BitIndexMath so it can be used in .cpp files

899d257

Appease flake8

8b40e47

More accurate upper bound for index calculation

af155da

Mark tests as largeTensorTest

c476c96

Don't assume XLA is CPU in large tensor tests

1a91aa8

Strides are always positive

656a5d0

Test: Assert view requires 64-bits and use itertools.product

bdd429a

Address review comments

1d759ec

Address more comments

96e78b4

More rigorous testing for CPU fallback

708aeec

Remove test duplication

46a21e7

Remove redundant grad comparison and add note for fallback testing

29a8131

peterbell10 force-pushed the gridsampler-64bit branch from 913a753 to 29a8131 Compare August 4, 2020 14:01

facebook-github-bot reviewed Aug 4, 2020

View reviewed changes

facebook-github-bot closed this in 33519e1 Aug 6, 2020

facebook-github-bot added the merged label Aug 7, 2020

zou3519 mentioned this pull request Aug 11, 2020

Check for memory overlap in more ATen ops #39878

Closed

balbasty mentioned this pull request Oct 6, 2020

Offset types are switched between CPP and CUDA in pushpull balbasty/nitorch#16

Closed

mruberry added the Merged label Oct 28, 2020

Fix 64-bit indexing in GridSampler #41923

Fix 64-bit indexing in GridSampler #41923

Uh oh!

Conversation

peterbell10 commented Jul 23, 2020

Uh oh!

dr-ci bot commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

ci.pytorch.org: 1 failed

Uh oh!

Uh oh!

ezyang commented Jul 27, 2020

Uh oh!

zou3519 commented Jul 27, 2020

Uh oh!

zou3519 commented Jul 28, 2020

Uh oh!

peterbell10 commented Jul 28, 2020

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zou3519 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

dr-ci bot commented Jul 23, 2020 •

edited

Loading

zou3519 left a comment •

edited

Loading