Skip to content

Conversation

@mcarilli
Copy link
Collaborator

@mcarilli mcarilli commented Jul 31, 2020

Should close #36428.

The cudnn RNN API expects weights to occupy a flat buffer in memory with a particular layout. This PR implements a "speed of light" fix: _cudnn_rnn_cast_reflatten (the autocast wrapper assigned to _cudnn_rnn) copies weights to the right slices of a flat FP16 buffer with a single read/write per weight (as opposed to casting them to FP16 individually then reflattening the individual FP16 weights, which would require 2 read/writes per weight).

It isn't pretty but IMO it doesn't make rnn bindings much more tortuous than they already are.

The test tries a forward under autocast and a backward for the full cross product of RNN options and input/weight/hidden dtypes. As for all FP16list autocast tests, forward output and backward grads are checked against a control where inputs (including RNN module weights in this case) are precasted to FP16 on the python side.

Not sure who to ask for review, tagging @ezyang and @ngimel because Ed wrote this file (almost 2 years ago) and Natalia did the most recent major surgery.

Side quests discovered:

  • Should we update persistent RNN heuristics to include compute capability 8.0? Could be another PR but seems easy enough to include.
  • Many (maybe all?!) the raw cudnn API calls in RNN.cpp are deprecated in cudnn 8. I don't mind taking the AI to update them since my mental cache is full of rnn stuff, but that would be a substantial separate PR.

@dr-ci
Copy link

dr-ci bot commented Jul 31, 2020

💊 CI failures summary and remediations

As of commit 8cf8c98 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 38 times.

@mcarilli mcarilli requested review from ezyang and ngimel August 3, 2020 23:23
@ezyang
Copy link
Contributor

ezyang commented Aug 4, 2020

oh god I probably should review this XD

@mcarilli
Copy link
Collaborator Author

mcarilli commented Aug 4, 2020

oh god I probably should review this XD

Thanks for volunteering 🚪🔥 abandon hope all ye who enter here. Given the complexity of the code and the small number of people who have safaried through it, I made the test exhaustive to keep us on track. The best thing I can say about my implementation is "if its stupid and it works it aint stupid."

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, can't continue today.

#include <ATen/cuda/CUDAConfig.h>

#if AT_CUDNN_ENABLED()
#include <ATen/native/cudnn/RNNUtils.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Danger danger will robinson. ATen/autocast_mode.cpp is compiled as part of ATen_cpu and it should not access any headers in the CUDA directory. You will probably have to chuck these autocast wrappers in a separate file in ATen/cuda

Copy link
Collaborator Author

@mcarilli mcarilli Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm but if I move my wrapper (along with the above include) to an inline function in a header in ATen/cuda, then include the header in autocast_mode.cpp, seems like I'm no better off.

Copy link
Collaborator Author

@mcarilli mcarilli Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean I should put my wrapper (_cudnn_rnn_cast_reflatten) declaration+definition in, say, ATen/cuda/AutocastRNN.h+.cpp,
have ATen/autocast_mode.cpp include AutocastRNN.h (or forward declare _cudnn_rnn_cast_reflatten),
and have ATen/cuda/AutocastRNN.cpp be the thing that includes ATen/native/cudnn/RNNUtils.h?
That would mean ATen/native/cudnn/RNNUtils.h doesn't wind up directly included in autocast_mode.cpp. Or am I misinterpreting?

Copy link
Collaborator Author

@mcarilli mcarilli Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are the files in ATen/cuda compiled into a separate library from ATen_cpu? If so, and I do the above, should I declare _cudnn_rnn_cast_reflatten with TORCH_CUDA_API?

Would doing the above using ATen/cudnn instead of ATen/cuda to host my files be equally valid? If so, I think cudnn rather than cuda is a more appropriate home for the RNN wrapper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should do the registration inside the cpp file in cuda. Then you do not need to include the header from ATen/autocast_mode.cpp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to put it in cudnn directory; both dirs end up in torch_cuda library in the end.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully resolved

// Utilities exposed in RNNUtils.h
namespace cudnn_rnn {

TORCH_CUDA_API std::tuple<Tensor, std::vector<Tensor>> copy_weights_to_flat_buf_views(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any substantive change to logic here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added knobs are to make sure it can service both _cudnn_rnn_flatten_weight and autocast::_cudnn_rnn_cast_reflatten.

The existing behavior of _cudnn_rnn_flatten_weight should remain unchanged.

@ezyang
Copy link
Contributor

ezyang commented Aug 5, 2020

While I am still not sure why you had to factor out a chunk of code into a separate helper, overall the changes seem reasonable and lightweight. Happy to approve when this is out of WIP.

@smessmer smessmer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 10, 2020
# so they get a dedicated test.
# Despite the large number of RNN cases it tries, the test takes < 15 seconds on a Titan V (similar to V100).
@unittest.skipIf(not TEST_CUDNN, 'CUDNN not available')
def test_autocast_rnn(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mruberry for an interesting ad hoc testing example

@ezyang
Copy link
Contributor

ezyang commented Aug 10, 2020


19:36:09 ======================================================================
19:36:09 FAIL: test_autocast_rnn (__main__.TestCuda)
19:36:09 ----------------------------------------------------------------------
19:36:09 Traceback (most recent call last):
19:36:09   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 815, in wrapper
19:36:09     method(*args, **kwargs)
19:36:09   File "test_cuda.py", line 2659, in test_autocast_rnn
19:36:09     self.assertEqual(out.grad_fn.name(), "CudnnRnnBackward")
19:36:09   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1126, in assertEqual
19:36:09     super().assertEqual(x, y, msg=msg)
19:36:09 AssertionError: 'CatBackward' != 'CudnnRnnBackward'
19:36:09 - CatBackward
19:36:09 + CudnnRnnBackward

rocm failure is real

also lint error

  {
    path: 'test/test_cuda.py',
    start_line: 2610,
    end_line: 2610,
    start_column: 5,
    end_column: 5,
    annotation_level: 'failure',
    message: '[E301] expected 1 blank line, found 0'
  }

@mcarilli
Copy link
Collaborator Author

mcarilli commented Aug 11, 2020

Thanks for quick review! Users will be happy about this.

oops, I left the lint failure deliberately as a reminder to discuss if the test should be tried without cudnn as well. What do you think?

As for rocm, are we ok to @skipIfRocm, or should the test be tried with rocm?
Tbh I'm confused why the existing @unittest.skipIf(not TEST_CUDNN, 'CUDNN not available') allowed the to run with rocm, is TEST_CUDNN considered true for rocm??

@ngimel
Copy link
Collaborator

ngimel commented Aug 11, 2020

Yes, TEST_CUDNN = TEST_CUDA and (TEST_WITH_ROCM or something else). And yes, it's ok to @skipIfRocm

@mcarilli
Copy link
Collaborator Author

mcarilli commented Aug 11, 2020

Failures look spurious now (rocm failure is in test_nn)

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Aug 17, 2020

OK, I need to lock changes to this PR from OSS side, as it looks like this PR needs fbcode side build system changes.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in fbf274f.

malfet added a commit to malfet/pytorch that referenced this pull request Aug 19, 2020
In some versions of GCC, tuple constructor from initializer list  is marked as explicit, which results in the following compilation error:
```
/var/lib/jenkins/workspace/aten/src/ATen/native/cudnn/RNN.cpp: In function 'std::tuple<at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > > at::native::cudnn_rnn::copy_weights_to_flat_buf_views(at::TensorList, int64_t, int64_t, int64_t, int64_t, int64_t, bool, bool, cudnnDataType_t, const c10::TensorOptions&, bool, bool, bool)':
/var/lib/jenkins/workspace/aten/src/ATen/native/cudnn/RNN.cpp:687:35: error: converting to 'std::tuple<at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > >' from initializer list would use explicit constructor 'constexpr std::tuple<_T1, _T2>::tuple(_U1&&, _U2&&) [with _U1 = at::Tensor&; _U2 = std::vector<at::Tensor>&; <template-parameter-2-3> = void; _T1 = at::Tensor; _T2 = std::vector<at::Tensor>]'
     return {weight_buf, params_arr};
```
This regression was introduced by pytorch#42385
facebook-github-bot pushed a commit that referenced this pull request Aug 19, 2020
Summary:
In some versions of GCC, tuple constructor from initializer list  is marked as explicit, which results in the following compilation error:
```
/var/lib/jenkins/workspace/aten/src/ATen/native/cudnn/RNN.cpp: In function 'std::tuple<at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > > at::native::cudnn_rnn::copy_weights_to_flat_buf_views(at::TensorList, int64_t, int64_t, int64_t, int64_t, int64_t, bool, bool, cudnnDataType_t, const c10::TensorOptions&, bool, bool, bool)':
/var/lib/jenkins/workspace/aten/src/ATen/native/cudnn/RNN.cpp:687:35: error: converting to 'std::tuple<at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > >' from initializer list would use explicit constructor 'constexpr std::tuple<_T1, _T2>::tuple(_U1&&, _U2&&) [with _U1 = at::Tensor&; _U2 = std::vector<at::Tensor>&; <template-parameter-2-3> = void; _T1 = at::Tensor; _T2 = std::vector<at::Tensor>]'
     return {weight_buf, params_arr};
```
This regression was introduced by #42385

Fixes #{issue number}

Pull Request resolved: #43244

Reviewed By: pbelevich

Differential Revision: D23205656

Pulled By: malfet

fbshipit-source-id: 51470386ad95290c7c99d733fc1fe655aa27d009
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AMP autocast error in cnn-lstm forward

7 participants