Skip to content

Conversation

@nkaretnikov
Copy link
Collaborator

@nkaretnikov nkaretnikov commented Mar 23, 2022

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 23, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 408b8cd (more details on the Dr. CI page):


  • 3/3 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (1/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-04-01T04:05:38.5421452Z The PR is introduc...m to confirm whether this change is wanted or not.
2022-04-01T04:05:38.5409272Z processing existing schema:  text(__torch__.torch.classes.profiling.SourceRef _0) -> (str _0)
2022-04-01T04:05:38.5410618Z processing existing schema:  count(__torch__.torch.classes.profiling.InstructionStats _0) -> (int _0)
2022-04-01T04:05:38.5411606Z processing existing schema:  duration_ns(__torch__.torch.classes.profiling.InstructionStats _0) -> (int _0)
2022-04-01T04:05:38.5412922Z processing existing schema:  source(__torch__.torch.classes.profiling.SourceStats _0) -> (__torch__.torch.classes.profiling.SourceRef _0)
2022-04-01T04:05:38.5414548Z processing existing schema:  line_map(__torch__.torch.classes.profiling.SourceStats _0) -> (Dict(int, __torch__.torch.classes.profiling.InstructionStats) _0)
2022-04-01T04:05:38.5415745Z processing existing schema:  __init__(__torch__.torch.classes.profiling._ScriptProfile _0) -> (NoneType _0)
2022-04-01T04:05:38.5416875Z processing existing schema:  enable(__torch__.torch.classes.profiling._ScriptProfile _0) -> (NoneType _0)
2022-04-01T04:05:38.5418092Z processing existing schema:  disable(__torch__.torch.classes.profiling._ScriptProfile _0) -> (NoneType _0)
2022-04-01T04:05:38.5419779Z processing existing schema:  _dump_stats(__torch__.torch.classes.profiling._ScriptProfile _0) -> (__torch__.torch.classes.profiling.SourceStats[] _0)
2022-04-01T04:05:38.5421202Z processing existing schema:  __init__(__torch__.torch.classes.dist_rpc.WorkerInfo _0, str _1, int _2) -> (NoneType _0)
2022-04-01T04:05:38.5421452Z The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not. 
2022-04-01T04:05:38.5421469Z 
2022-04-01T04:05:38.5421559Z Broken ops: [
2022-04-01T04:05:38.5421865Z 	aten::nested_tensor(Tensor[] list, int? dtype=None, int? layout=None, Device? device=None, bool? pin_memory=None) -> (Tensor)
2022-04-01T04:05:38.5422039Z 	aten::special_log_ndtr(Tensor self) -> (Tensor)
2022-04-01T04:05:38.5422257Z 	aten::special_log_ndtr.out(Tensor self, *, Tensor(a!) out) -> (Tensor(a!))
2022-04-01T04:05:38.5422342Z ]
2022-04-01T04:05:38.6235201Z + cleanup
2022-04-01T04:05:38.6235285Z + retcode=1
2022-04-01T04:05:38.6235366Z + set +x
2022-04-01T04:05:38.6274099Z ##[error]Process completed with exit code 1.

See GitHub Actions build pull / win-vs2019-cuda11.3-py3 / build (2/3)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-04-01T04:04:17.6431193Z CMake Error: Error...ariable not set, cmake may not be built correctly.
2022-04-01T04:04:17.1908519Z CMAKE_CUDA_COMPILE_WHOLE_COMPILATION
2022-04-01T04:04:17.2017489Z CMake Error: Error required internal CMake variable not set, cmake may not be built correctly.
2022-04-01T04:04:17.2017772Z Missing variable is:
2022-04-01T04:04:17.2017981Z CMAKE_CUDA_COMPILE_WHOLE_COMPILATION
2022-04-01T04:04:17.2118839Z CMake Error: Error required internal CMake variable not set, cmake may not be built correctly.
2022-04-01T04:04:17.2119094Z Missing variable is:
2022-04-01T04:04:17.2119296Z CMAKE_CUDA_COMPILE_WHOLE_COMPILATION
2022-04-01T04:04:17.2169280Z CMake Error: Error required internal CMake variable not set, cmake may not be built correctly.
2022-04-01T04:04:17.2169544Z Missing variable is:
2022-04-01T04:04:17.2169744Z CMAKE_CUDA_COMPILE_WHOLE_COMPILATION
2022-04-01T04:04:17.6431193Z CMake Error: Error required internal CMake variable not set, cmake may not be built correctly.
2022-04-01T04:04:17.6431524Z Missing variable is:
2022-04-01T04:04:17.6431742Z CMAKE_CUDA_COMPILE_WHOLE_COMPILATION
2022-04-01T04:04:17.7866931Z -- Generating done
2022-04-01T04:04:17.9073090Z CMake Warning:
2022-04-01T04:04:17.9073424Z   Manually-specified variables were not used by the project:
2022-04-01T04:04:17.9073593Z 
2022-04-01T04:04:17.9073674Z     BUILD_ENVIRONMENT
2022-04-01T04:04:17.9073832Z     BUILD_TYPE
2022-04-01T04:04:17.9073998Z     BUILD_WHEEL
2022-04-01T04:04:17.9074097Z 

See GitHub Actions build pull / linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu) (3/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-04-01T05:28:15.3625765Z RuntimeError: test_reductions failed! Received signal: SIGIOT
2022-04-01T05:28:14.8615460Z   test_mean_dim_cuda (__main__.TestReductionsCUDA) ... skip: Only runs on cpu (0.001s)
2022-04-01T05:28:14.8759476Z   test_median_corner_cases_cuda (__main__.TestReductionsCUDA) ... ok (0.014s)
2022-04-01T05:28:15.0712443Z   test_median_nan_values_cuda_float16 (__main__.TestReductionsCUDA) ... Memory exception on virtual address 0x7f36d87d8000, node id 4 : Page not present
2022-04-01T05:28:15.0717623Z Address does not belong to a known buffer
2022-04-01T05:28:15.0719765Z Memory access fault by GPU node-4 (Agent handle: 0x55872701a210) on address 0x7f36d87d8000. Reason: Page not present or supervisor privilege.
2022-04-01T05:28:15.3615833Z Traceback (most recent call last):
2022-04-01T05:28:15.3616701Z   File "test/run_test.py", line 1054, in <module>
2022-04-01T05:28:15.3617486Z     main()
2022-04-01T05:28:15.3618170Z   File "test/run_test.py", line 1032, in main
2022-04-01T05:28:15.3624857Z     raise RuntimeError(err_message)
2022-04-01T05:28:15.3625765Z RuntimeError: test_reductions failed! Received signal: SIGIOT
2022-04-01T05:28:17.0372326Z 
2022-04-01T05:28:17.0372860Z real	50m24.842s
2022-04-01T05:28:17.0373538Z user	100m16.222s
2022-04-01T05:28:17.0374142Z sys	54m58.532s
2022-04-01T05:28:17.0374713Z + cleanup
2022-04-01T05:28:17.0375246Z + retcode=1
2022-04-01T05:28:17.0375796Z + set +x
2022-04-01T05:28:17.0508484Z ##[error]Process completed with exit code 1.
2022-04-01T05:28:17.0636253Z Prepare all required actions
2022-04-01T05:28:17.0636815Z Getting action download info

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

nkaretnikov pushed a commit that referenced this pull request Mar 23, 2022
Fixes #73187.

ghstack-source-id: 30f7a27
Pull Request resolved: #74635
@nkaretnikov nkaretnikov added module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: nn Related to torch.nn high priority labels Mar 23, 2022
@nkaretnikov nkaretnikov requested a review from albanD March 23, 2022 20:30
@nkaretnikov
Copy link
Collaborator Author

(putting @albanD in reviewers for visibility, feel free to re-assign)

@nkaretnikov
Copy link
Collaborator Author

basically, the idea here is this: move the checks from the top-level grid_sampler function into a separate header/cpp file such that it can be used/linked with code on cpu and cuda, and use the said functions in all grid_sampler* kernels. added some basic tests to make sure things do not crash anymore when you pass in weird shapes

@nkaretnikov
Copy link
Collaborator Author

the changes to error messages in old files are due to me moving/splitting the checks for 4d/5d cases

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the cleanup. it looks quite good. High level questions:

  • Are the CI failures relevant?
  • It would be much simpler if the utils functions were only in the header (no cpp) file. That would remove the need to update the build systemsss. Do you think that would be a better way to go here to make this change simpler?

@nkaretnikov
Copy link
Collaborator Author

@albanD i'll look at the ci failures tomorrow probably, but to answer your second question: i think the current way is actually the easiest. i also felt uncomfortable about adding new files, but i think you just have to do it for the following reasons:

  • to support incremental (or whatever the proper term is) builds, that is, not depending on native_functions.yaml, as mentioned here:
// See NOTE: [Tensor vs. TensorBase]
// https://github.com/pytorch/pytorch/pull/66979

you need to avoid exposing ATen/ATen.h in the header and not using types/headers that may pull it in. which is why i use TensorBase instead of Tensor and int64_t instead of GridSamplerInterpolation in the header. i did have to move the function bodies out because the code needs both of these in the implementations.

#define TORCH_ASSERT_NO_OPERATORS
#include <ATen/native/cuda/GridSampler.h>
#include <ATen/native/GridSamplerUtils.h>

i'd need to include the header before the assert (or it won't compile), and it would be undermining the work done by peter in the pr i linked above (#66979). so i simply don't see any other way to do it.

i did try the header-only thing as well, i even have a wip branch with that work already. i just need to double check that it actually works and i should be able to share it. but given the things i mentioned above, should i?

cc @peterbell10

nkaretnikov pushed a commit that referenced this pull request Mar 26, 2022
Fixes #73187.

ghstack-source-id: cfac8ca
Pull Request resolved: #74635
@nkaretnikov
Copy link
Collaborator Author

pushed a small update to fix the linter issue and rocm builds. not sure if it fixes everything because i can only do cuda builds locally atm. let's see what ci thinks

nkaretnikov pushed a commit that referenced this pull request Mar 28, 2022
Fixes #73187.

ghstack-source-id: 3c7609a
Pull Request resolved: #74635
@nkaretnikov
Copy link
Collaborator Author

nkaretnikov commented Mar 28, 2022

pushed an update implementing the header-only approach -- no build system changes, let's see what ci thinks

@nkaretnikov
Copy link
Collaborator Author

i wonder if i should write more test cases such that i trigger every TORCH_CHECK. but it might be tricky, not sure it's worth it. thoughts?

Copy link
Collaborator

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorBase related stuff looks good to me. Didn't check the grid_sampler changes too closely but the tests are broken at the moment.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
Thanks for taking the time to update the code to work well with TensorMeta!

@albanD
Copy link
Collaborator

albanD commented Mar 29, 2022

The newly added test is failing in CI though

@nkaretnikov
Copy link
Collaborator Author

i'll take a look at the failures. will follow up later

@nkaretnikov nkaretnikov requested a review from bdhirsh as a code owner March 29, 2022 21:13
nkaretnikov pushed a commit that referenced this pull request Mar 29, 2022
Fixes #73187.

ghstack-source-id: 3c88d89
Pull Request resolved: #74635
@nkaretnikov
Copy link
Collaborator Author

couldn't make sense of the ci errors, so rebased and pushed. let's wait for ci again and see if this changes anything

@nkaretnikov
Copy link
Collaborator Author

will fix this rocm failure once the ci is done:

======================================================================
FAIL [0.023s]: test_invalid_shapes_grid_sampler_cuda (__main__.TestTorchDeviceTypeCUDA)
----------------------------------------------------------------------
RuntimeError: cudnn_grid_sampler_forward: ATen not compiled with cuDNN support

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1780, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 376, in instantiated_test
    result = test(self, **param_kwargs)
  File "test_torch.py", line 1448, in test_invalid_shapes_grid_sampler
    torch.cudnn_grid_sampler(input, grid)
AssertionError: "expected grid and input to have same batch size" does not match "cudnn_grid_sampler_forward: ATen not compiled with cuDNN support"

nkaretnikov pushed a commit that referenced this pull request Mar 30, 2022
Fixes #73187.

ghstack-source-id: 5da3885
Pull Request resolved: #74635
@nkaretnikov
Copy link
Collaborator Author

pushed a fix for the rocm failure

@nkaretnikov
Copy link
Collaborator Author

does ci need to be updated?

https://github.com/pytorch/pytorch/runs/5761511611

W: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64  Release' does not have a Release file.
W: The repository 'https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64  Release' does not have a Release file.
E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/Packages  404  Not Found
E: Failed to fetch https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/Packages  404  Not Found

@albanD
Copy link
Collaborator

albanD commented Mar 31, 2022

The hud: https://hud.pytorch.org/ is a good place to check CI status. There is currently a SEV there meaning that something is broken on their end yes.

@albanD
Copy link
Collaborator

albanD commented Mar 31, 2022

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@nkaretnikov
Copy link
Collaborator Author

nkaretnikov commented Apr 1, 2022

okay, it's essential that that one test is fixed/working tho (before this can be merged) because it's the one i'm trying to repro (see the other pr in the stack for discussion on this)

upd: nevermind, replied with the wrong stack in mind. but that test is still pretty important tho

@nkaretnikov
Copy link
Collaborator Author

i've just triggered a re-run because it looks like this is fixed on hud(?). not sure if i need to rebase again tho

@albanD
Copy link
Collaborator

albanD commented Apr 1, 2022

The new CI run looks good. So I'm going to try to land as-is. Will let you know if this needs a rebase.

facebook-github-bot pushed a commit that referenced this pull request Apr 1, 2022
Summary:
Pull Request resolved: #74635

Fixes #73187.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D35284563

Pulled By: albanD

fbshipit-source-id: 1477c506b8755d864ca902ee140bee7bdb0069b0
@github-actions
Copy link
Contributor

github-actions bot commented Apr 1, 2022

Hey @nkaretnikov.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@albanD
Copy link
Collaborator

albanD commented Apr 1, 2022

The cuda test failures were actually real and this broke master: https://github.com/pytorch/pytorch/runs/5789789694?check_suite_focus=true. Reverting.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 0ce02ea. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 0ce02ea. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed high priority module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: nn Related to torch.nn open source Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants