Use the same checks in all `grid_sampler` functions #74635

nkaretnikov · 2022-03-23T20:27:03Z

Stack from ghstack:

-> Use the same checks in all grid_sampler functions #74635

Fixes #73187.

Differential Revision: D35284563

Fixes #73187. [ghstack-poisoned]

facebook-github-bot · 2022-03-23T20:27:09Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/74635
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

💊 CI failures summary and remediations

As of commit 408b8cd (more details on the Dr. CI page):

3/3 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

pull / linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (1/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-04-01T04:05:38.5421452Z The PR is introduc...m to confirm whether this change is wanted or not.

2022-04-01T04:05:38.5409272Z processing existing schema:  text(__torch__.torch.classes.profiling.SourceRef _0) -> (str _0)
2022-04-01T04:05:38.5410618Z processing existing schema:  count(__torch__.torch.classes.profiling.InstructionStats _0) -> (int _0)
2022-04-01T04:05:38.5411606Z processing existing schema:  duration_ns(__torch__.torch.classes.profiling.InstructionStats _0) -> (int _0)
2022-04-01T04:05:38.5412922Z processing existing schema:  source(__torch__.torch.classes.profiling.SourceStats _0) -> (__torch__.torch.classes.profiling.SourceRef _0)
2022-04-01T04:05:38.5414548Z processing existing schema:  line_map(__torch__.torch.classes.profiling.SourceStats _0) -> (Dict(int, __torch__.torch.classes.profiling.InstructionStats) _0)
2022-04-01T04:05:38.5415745Z processing existing schema:  __init__(__torch__.torch.classes.profiling._ScriptProfile _0) -> (NoneType _0)
2022-04-01T04:05:38.5416875Z processing existing schema:  enable(__torch__.torch.classes.profiling._ScriptProfile _0) -> (NoneType _0)
2022-04-01T04:05:38.5418092Z processing existing schema:  disable(__torch__.torch.classes.profiling._ScriptProfile _0) -> (NoneType _0)
2022-04-01T04:05:38.5419779Z processing existing schema:  _dump_stats(__torch__.torch.classes.profiling._ScriptProfile _0) -> (__torch__.torch.classes.profiling.SourceStats[] _0)
2022-04-01T04:05:38.5421202Z processing existing schema:  __init__(__torch__.torch.classes.dist_rpc.WorkerInfo _0, str _1, int _2) -> (NoneType _0)
2022-04-01T04:05:38.5421452Z The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not. 
2022-04-01T04:05:38.5421469Z 
2022-04-01T04:05:38.5421559Z Broken ops: [
2022-04-01T04:05:38.5421865Z 	aten::nested_tensor(Tensor[] list, int? dtype=None, int? layout=None, Device? device=None, bool? pin_memory=None) -> (Tensor)
2022-04-01T04:05:38.5422039Z 	aten::special_log_ndtr(Tensor self) -> (Tensor)
2022-04-01T04:05:38.5422257Z 	aten::special_log_ndtr.out(Tensor self, *, Tensor(a!) out) -> (Tensor(a!))
2022-04-01T04:05:38.5422342Z ]
2022-04-01T04:05:38.6235201Z + cleanup
2022-04-01T04:05:38.6235285Z + retcode=1
2022-04-01T04:05:38.6235366Z + set +x
2022-04-01T04:05:38.6274099Z ##[error]Process completed with exit code 1.

pull / win-vs2019-cuda11.3-py3 / build (2/3)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-04-01T04:04:17.6431193Z CMake Error: Error...ariable not set, cmake may not be built correctly.

2022-04-01T04:04:17.1908519Z CMAKE_CUDA_COMPILE_WHOLE_COMPILATION
2022-04-01T04:04:17.2017489Z CMake Error: Error required internal CMake variable not set, cmake may not be built correctly.
2022-04-01T04:04:17.2017772Z Missing variable is:
2022-04-01T04:04:17.2017981Z CMAKE_CUDA_COMPILE_WHOLE_COMPILATION
2022-04-01T04:04:17.2118839Z CMake Error: Error required internal CMake variable not set, cmake may not be built correctly.
2022-04-01T04:04:17.2119094Z Missing variable is:
2022-04-01T04:04:17.2119296Z CMAKE_CUDA_COMPILE_WHOLE_COMPILATION
2022-04-01T04:04:17.2169280Z CMake Error: Error required internal CMake variable not set, cmake may not be built correctly.
2022-04-01T04:04:17.2169544Z Missing variable is:
2022-04-01T04:04:17.2169744Z CMAKE_CUDA_COMPILE_WHOLE_COMPILATION
2022-04-01T04:04:17.6431193Z CMake Error: Error required internal CMake variable not set, cmake may not be built correctly.
2022-04-01T04:04:17.6431524Z Missing variable is:
2022-04-01T04:04:17.6431742Z CMAKE_CUDA_COMPILE_WHOLE_COMPILATION
2022-04-01T04:04:17.7866931Z -- Generating done
2022-04-01T04:04:17.9073090Z CMake Warning:
2022-04-01T04:04:17.9073424Z   Manually-specified variables were not used by the project:
2022-04-01T04:04:17.9073593Z 
2022-04-01T04:04:17.9073674Z     BUILD_ENVIRONMENT
2022-04-01T04:04:17.9073832Z     BUILD_TYPE
2022-04-01T04:04:17.9073998Z     BUILD_WHEEL
2022-04-01T04:04:17.9074097Z

pull / linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu) (3/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-04-01T05:28:15.3625765Z RuntimeError: test_reductions failed! Received signal: SIGIOT

2022-04-01T05:28:14.8615460Z   test_mean_dim_cuda (__main__.TestReductionsCUDA) ... skip: Only runs on cpu (0.001s)
2022-04-01T05:28:14.8759476Z   test_median_corner_cases_cuda (__main__.TestReductionsCUDA) ... ok (0.014s)
2022-04-01T05:28:15.0712443Z   test_median_nan_values_cuda_float16 (__main__.TestReductionsCUDA) ... Memory exception on virtual address 0x7f36d87d8000, node id 4 : Page not present
2022-04-01T05:28:15.0717623Z Address does not belong to a known buffer
2022-04-01T05:28:15.0719765Z Memory access fault by GPU node-4 (Agent handle: 0x55872701a210) on address 0x7f36d87d8000. Reason: Page not present or supervisor privilege.
2022-04-01T05:28:15.3615833Z Traceback (most recent call last):
2022-04-01T05:28:15.3616701Z   File "test/run_test.py", line 1054, in <module>
2022-04-01T05:28:15.3617486Z     main()
2022-04-01T05:28:15.3618170Z   File "test/run_test.py", line 1032, in main
2022-04-01T05:28:15.3624857Z     raise RuntimeError(err_message)
2022-04-01T05:28:15.3625765Z RuntimeError: test_reductions failed! Received signal: SIGIOT
2022-04-01T05:28:17.0372326Z 
2022-04-01T05:28:17.0372860Z real	50m24.842s
2022-04-01T05:28:17.0373538Z user	100m16.222s
2022-04-01T05:28:17.0374142Z sys	54m58.532s
2022-04-01T05:28:17.0374713Z + cleanup
2022-04-01T05:28:17.0375246Z + retcode=1
2022-04-01T05:28:17.0375796Z + set +x
2022-04-01T05:28:17.0508484Z ##[error]Process completed with exit code 1.
2022-04-01T05:28:17.0636253Z Prepare all required actions
2022-04-01T05:28:17.0636815Z Getting action download info

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Fixes #73187. ghstack-source-id: 30f7a27 Pull Request resolved: #74635

nkaretnikov · 2022-03-23T20:31:19Z

(putting @albanD in reviewers for visibility, feel free to re-assign)

nkaretnikov · 2022-03-23T20:34:26Z

basically, the idea here is this: move the checks from the top-level grid_sampler function into a separate header/cpp file such that it can be used/linked with code on cpu and cuda, and use the said functions in all grid_sampler* kernels. added some basic tests to make sure things do not crash anymore when you pass in weird shapes

nkaretnikov · 2022-03-23T20:35:29Z

the changes to error messages in old files are due to me moving/splitting the checks for 4d/5d cases

albanD

Thanks for the cleanup. it looks quite good. High level questions:

Are the CI failures relevant?
It would be much simpler if the utils functions were only in the header (no cpp) file. That would remove the need to update the build systemsss. Do you think that would be a better way to go here to make this change simpler?

nkaretnikov · 2022-03-25T23:22:41Z

@albanD i'll look at the ci failures tomorrow probably, but to answer your second question: i think the current way is actually the easiest. i also felt uncomfortable about adding new files, but i think you just have to do it for the following reasons:

to support incremental (or whatever the proper term is) builds, that is, not depending on native_functions.yaml, as mentioned here:

// See NOTE: [Tensor vs. TensorBase]
// https://github.com/pytorch/pytorch/pull/66979

you need to avoid exposing ATen/ATen.h in the header and not using types/headers that may pull it in. which is why i use TensorBase instead of Tensor and int64_t instead of GridSamplerInterpolation in the header. i did have to move the function bodies out because the code needs both of these in the implementations.

otherwise, here:

#define TORCH_ASSERT_NO_OPERATORS
#include <ATen/native/cuda/GridSampler.h>
#include <ATen/native/GridSamplerUtils.h>

i'd need to include the header before the assert (or it won't compile), and it would be undermining the work done by peter in the pr i linked above (#66979). so i simply don't see any other way to do it.

i did try the header-only thing as well, i even have a wip branch with that work already. i just need to double check that it actually works and i should be able to share it. but given the things i mentioned above, should i?

cc @peterbell10

aten/src/ATen/native/GridSamplerUtils.cpp

aten/src/ATen/CMakeLists.txt

Fixes #73187. [ghstack-poisoned]

Fixes #73187. ghstack-source-id: cfac8ca Pull Request resolved: #74635

nkaretnikov · 2022-03-26T21:58:11Z

pushed a small update to fix the linter issue and rocm builds. not sure if it fixes everything because i can only do cuda builds locally atm. let's see what ci thinks

aten/src/ATen/native/GridSamplerUtils.cpp

Fixes #73187. [ghstack-poisoned]

Fixes #73187. ghstack-source-id: 3c7609a Pull Request resolved: #74635

nkaretnikov · 2022-03-28T22:15:11Z

pushed an update implementing the header-only approach -- no build system changes, let's see what ci thinks

nkaretnikov · 2022-03-28T22:17:04Z

i wonder if i should write more test cases such that i trigger every TORCH_CHECK. but it might be tricky, not sure it's worth it. thoughts?

peterbell10

TensorBase related stuff looks good to me. Didn't check the grid_sampler changes too closely but the tests are broken at the moment.

albanD

Looks good!
Thanks for taking the time to update the code to work well with TensorMeta!

albanD · 2022-03-29T17:16:18Z

The newly added test is failing in CI though

nkaretnikov · 2022-03-29T20:18:33Z

i'll take a look at the failures. will follow up later

Fixes #73187. [ghstack-poisoned]

Fixes #73187. ghstack-source-id: 3c88d89 Pull Request resolved: #74635

nkaretnikov · 2022-03-29T21:16:39Z

couldn't make sense of the ci errors, so rebased and pushed. let's wait for ci again and see if this changes anything

nkaretnikov · 2022-03-30T00:17:25Z

will fix this rocm failure once the ci is done:

======================================================================
FAIL [0.023s]: test_invalid_shapes_grid_sampler_cuda (__main__.TestTorchDeviceTypeCUDA)
----------------------------------------------------------------------
RuntimeError: cudnn_grid_sampler_forward: ATen not compiled with cuDNN support

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1780, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 376, in instantiated_test
    result = test(self, **param_kwargs)
  File "test_torch.py", line 1448, in test_invalid_shapes_grid_sampler
    torch.cudnn_grid_sampler(input, grid)
AssertionError: "expected grid and input to have same batch size" does not match "cudnn_grid_sampler_forward: ATen not compiled with cuDNN support"

Fixes #73187. [ghstack-poisoned]

Fixes #73187. ghstack-source-id: 5da3885 Pull Request resolved: #74635

nkaretnikov · 2022-03-30T21:09:44Z

pushed a fix for the rocm failure

nkaretnikov · 2022-03-31T05:20:30Z

does ci need to be updated?

https://github.com/pytorch/pytorch/runs/5761511611

W: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64  Release' does not have a Release file.
W: The repository 'https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64  Release' does not have a Release file.
E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/Packages  404  Not Found
E: Failed to fetch https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/Packages  404  Not Found

albanD · 2022-03-31T14:43:15Z

The hud: https://hud.pytorch.org/ is a good place to check CI status. There is currently a SEV there meaning that something is broken on their end yes.

albanD · 2022-03-31T14:45:03Z

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

nkaretnikov · 2022-04-01T03:49:17Z

~~okay, it's essential that that one test is fixed/working tho (before this can be merged) because it's the one i'm trying to repro (see the other pr in the stack for discussion on this)~~

upd: nevermind, replied with the wrong stack in mind. but that test is still pretty important tho

nkaretnikov · 2022-04-01T03:52:53Z

i've just triggered a re-run because it looks like this is fixed on hud(?). not sure if i need to rebase again tho

albanD · 2022-04-01T14:17:00Z

The new CI run looks good. So I'm going to try to land as-is. Will let you know if this needs a rebase.

Summary: Pull Request resolved: #74635 Fixes #73187. Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D35284563 Pulled By: albanD fbshipit-source-id: 1477c506b8755d864ca902ee140bee7bdb0069b0

github-actions · 2022-04-01T14:26:51Z

Hey @nkaretnikov.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

albanD · 2022-04-01T16:42:16Z

The cuda test failures were actually real and this broke master: https://github.com/pytorch/pytorch/runs/5789789694?check_suite_focus=true. Reverting.

facebook-github-bot · 2022-04-01T16:46:02Z

This pull request has been reverted by 0ce02ea. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).

facebook-github-bot · 2022-04-05T00:55:27Z

This pull request has been reverted by 0ce02ea. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).

Use the same checks in all grid_sampler functions

614e5ab

Fixes #73187. [ghstack-poisoned]

facebook-github-bot added the cla signed label Mar 23, 2022

nkaretnikov pushed a commit that referenced this pull request Mar 23, 2022

Use the same checks in all grid_sampler functions

0a10dce

Fixes #73187. ghstack-source-id: 30f7a27 Pull Request resolved: #74635

nkaretnikov added module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: nn Related to torch.nn high priority labels Mar 23, 2022

pytorchbot added the open source label Mar 23, 2022

nkaretnikov mentioned this pull request Mar 23, 2022

grid_sampler_3d_backward aborts due to heap corruption #73187

Closed

nkaretnikov requested a review from albanD March 23, 2022 20:30

albanD reviewed Mar 25, 2022

View reviewed changes

peterbell10 requested changes Mar 25, 2022

View reviewed changes

aten/src/ATen/native/GridSamplerUtils.cpp Outdated Show resolved Hide resolved

aten/src/ATen/CMakeLists.txt Outdated Show resolved Hide resolved

Update on "Use the same checks in all grid_sampler functions"

f9415fc

Fixes #73187. [ghstack-poisoned]

nkaretnikov pushed a commit that referenced this pull request Mar 26, 2022

Use the same checks in all grid_sampler functions

0a3c6d9

Fixes #73187. ghstack-source-id: cfac8ca Pull Request resolved: #74635

peterbell10 reviewed Mar 27, 2022

View reviewed changes

aten/src/ATen/native/GridSamplerUtils.cpp Outdated Show resolved Hide resolved

Update on "Use the same checks in all grid_sampler functions"

c92a521

Fixes #73187. [ghstack-poisoned]

nkaretnikov pushed a commit that referenced this pull request Mar 28, 2022

Use the same checks in all grid_sampler functions

6e40f21

Fixes #73187. ghstack-source-id: 3c7609a Pull Request resolved: #74635

nkaretnikov requested review from albanD and peterbell10 March 28, 2022 22:15

peterbell10 reviewed Mar 29, 2022

View reviewed changes

albanD approved these changes Mar 29, 2022

View reviewed changes

Update on "Use the same checks in all grid_sampler functions"

e33a257

Fixes #73187. [ghstack-poisoned]

nkaretnikov requested a review from bdhirsh as a code owner March 29, 2022 21:13

nkaretnikov pushed a commit that referenced this pull request Mar 29, 2022

Use the same checks in all grid_sampler functions

85f9bb9

Fixes #73187. ghstack-source-id: 3c88d89 Pull Request resolved: #74635

Update on "Use the same checks in all grid_sampler functions"

408b8cd

Fixes #73187. [ghstack-poisoned]

nkaretnikov pushed a commit that referenced this pull request Mar 30, 2022

Use the same checks in all grid_sampler functions

cda5134

Fixes #73187. ghstack-source-id: 5da3885 Pull Request resolved: #74635

pytorchmergebot closed this in 835cc66 Apr 1, 2022

facebook-github-bot added the Reverted label Apr 1, 2022

nkaretnikov mentioned this pull request Apr 4, 2022

Use the same checks in all grid_sampler functions #75164

Closed

facebook-github-bot deleted the gh/nkaretnikov/35/head branch April 5, 2022 14:17

zengk95 mentioned this pull request Apr 5, 2022

[Meta] CI Revert Tracker #66178

Closed

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

Use the same checks in all grid_sampler functions #74635

Use the same checks in all grid_sampler functions #74635

Uh oh!

Conversation

nkaretnikov commented Mar 23, 2022 • edited by albanD Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Mar 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 3 new failures recognized by patterns

pull / linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (1/3)

pull / win-vs2019-cuda11.3-py3 / build (2/3)

pull / linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu) (3/3)

Uh oh!

nkaretnikov commented Mar 23, 2022

Uh oh!

nkaretnikov commented Mar 23, 2022

Uh oh!

nkaretnikov commented Mar 23, 2022

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

nkaretnikov commented Mar 25, 2022

Uh oh!

Uh oh!

Uh oh!

nkaretnikov commented Mar 26, 2022

Uh oh!

Uh oh!

nkaretnikov commented Mar 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nkaretnikov commented Mar 28, 2022

Uh oh!

peterbell10 left a comment

Choose a reason for hiding this comment

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

albanD commented Mar 29, 2022

Uh oh!

nkaretnikov commented Mar 29, 2022

Uh oh!

nkaretnikov commented Mar 29, 2022

Uh oh!

nkaretnikov commented Mar 30, 2022

Uh oh!

nkaretnikov commented Mar 30, 2022

Uh oh!

nkaretnikov commented Mar 31, 2022

Uh oh!

albanD commented Mar 31, 2022

Uh oh!

albanD commented Mar 31, 2022

Uh oh!

nkaretnikov commented Apr 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nkaretnikov commented Apr 1, 2022

Uh oh!

albanD commented Apr 1, 2022

Uh oh!

github-actions bot commented Apr 1, 2022

Uh oh!

albanD commented Apr 1, 2022

Uh oh!

facebook-github-bot commented Apr 1, 2022

Uh oh!

facebook-github-bot commented Apr 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Use the same checks in all `grid_sampler` functions #74635

Use the same checks in all `grid_sampler` functions #74635

nkaretnikov commented Mar 23, 2022 •

edited by albanD

Loading

facebook-github-bot commented Mar 23, 2022 •

edited

Loading

nkaretnikov commented Mar 28, 2022 •

edited

Loading

nkaretnikov commented Apr 1, 2022 •

edited

Loading