-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Autocast support for cudnn RNNs #42385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… if bias=False, need to debug.
💊 CI failures summary and remediationsAs of commit 8cf8c98 (more details on the Dr. CI page):
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 38 times. |
|
oh god I probably should review this XD |
Thanks for volunteering 🚪🔥 abandon hope all ye who enter here. Given the complexity of the code and the small number of people who have safaried through it, I made the test exhaustive to keep us on track. The best thing I can say about my implementation is "if its stupid and it works it aint stupid." |
ngimel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, can't continue today.
aten/src/ATen/autocast_mode.cpp
Outdated
| #include <ATen/cuda/CUDAConfig.h> | ||
|
|
||
| #if AT_CUDNN_ENABLED() | ||
| #include <ATen/native/cudnn/RNNUtils.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Danger danger will robinson. ATen/autocast_mode.cpp is compiled as part of ATen_cpu and it should not access any headers in the CUDA directory. You will probably have to chuck these autocast wrappers in a separate file in ATen/cuda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm but if I move my wrapper (along with the above include) to an inline function in a header in ATen/cuda, then include the header in autocast_mode.cpp, seems like I'm no better off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean I should put my wrapper (_cudnn_rnn_cast_reflatten) declaration+definition in, say, ATen/cuda/AutocastRNN.h+.cpp,
have ATen/autocast_mode.cpp include AutocastRNN.h (or forward declare _cudnn_rnn_cast_reflatten),
and have ATen/cuda/AutocastRNN.cpp be the thing that includes ATen/native/cudnn/RNNUtils.h?
That would mean ATen/native/cudnn/RNNUtils.h doesn't wind up directly included in autocast_mode.cpp. Or am I misinterpreting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, are the files in ATen/cuda compiled into a separate library from ATen_cpu? If so, and I do the above, should I declare _cudnn_rnn_cast_reflatten with TORCH_CUDA_API?
Would doing the above using ATen/cudnn instead of ATen/cuda to host my files be equally valid? If so, I think cudnn rather than cuda is a more appropriate home for the RNN wrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should do the registration inside the cpp file in cuda. Then you do not need to include the header from ATen/autocast_mode.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to put it in cudnn directory; both dirs end up in torch_cuda library in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully resolved
| // Utilities exposed in RNNUtils.h | ||
| namespace cudnn_rnn { | ||
|
|
||
| TORCH_CUDA_API std::tuple<Tensor, std::vector<Tensor>> copy_weights_to_flat_buf_views( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any substantive change to logic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added knobs are to make sure it can service both _cudnn_rnn_flatten_weight and autocast::_cudnn_rnn_cast_reflatten.
The existing behavior of _cudnn_rnn_flatten_weight should remain unchanged.
|
While I am still not sure why you had to factor out a chunk of code into a separate helper, overall the changes seem reasonable and lightweight. Happy to approve when this is out of WIP. |
| # so they get a dedicated test. | ||
| # Despite the large number of RNN cases it tries, the test takes < 15 seconds on a Titan V (similar to V100). | ||
| @unittest.skipIf(not TEST_CUDNN, 'CUDNN not available') | ||
| def test_autocast_rnn(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mruberry for an interesting ad hoc testing example
rocm failure is real also lint error |
|
Thanks for quick review! Users will be happy about this. oops, I left the lint failure deliberately as a reminder to discuss if the test should be tried without cudnn as well. What do you think? As for rocm, are we ok to |
|
Yes, |
|
Failures look spurious now (rocm failure is in test_nn) |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
OK, I need to lock changes to this PR from OSS side, as it looks like this PR needs fbcode side build system changes. |
In some versions of GCC, tuple constructor from initializer list is marked as explicit, which results in the following compilation error:
```
/var/lib/jenkins/workspace/aten/src/ATen/native/cudnn/RNN.cpp: In function 'std::tuple<at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > > at::native::cudnn_rnn::copy_weights_to_flat_buf_views(at::TensorList, int64_t, int64_t, int64_t, int64_t, int64_t, bool, bool, cudnnDataType_t, const c10::TensorOptions&, bool, bool, bool)':
/var/lib/jenkins/workspace/aten/src/ATen/native/cudnn/RNN.cpp:687:35: error: converting to 'std::tuple<at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > >' from initializer list would use explicit constructor 'constexpr std::tuple<_T1, _T2>::tuple(_U1&&, _U2&&) [with _U1 = at::Tensor&; _U2 = std::vector<at::Tensor>&; <template-parameter-2-3> = void; _T1 = at::Tensor; _T2 = std::vector<at::Tensor>]'
return {weight_buf, params_arr};
```
This regression was introduced by pytorch#42385
Summary:
In some versions of GCC, tuple constructor from initializer list is marked as explicit, which results in the following compilation error:
```
/var/lib/jenkins/workspace/aten/src/ATen/native/cudnn/RNN.cpp: In function 'std::tuple<at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > > at::native::cudnn_rnn::copy_weights_to_flat_buf_views(at::TensorList, int64_t, int64_t, int64_t, int64_t, int64_t, bool, bool, cudnnDataType_t, const c10::TensorOptions&, bool, bool, bool)':
/var/lib/jenkins/workspace/aten/src/ATen/native/cudnn/RNN.cpp:687:35: error: converting to 'std::tuple<at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > >' from initializer list would use explicit constructor 'constexpr std::tuple<_T1, _T2>::tuple(_U1&&, _U2&&) [with _U1 = at::Tensor&; _U2 = std::vector<at::Tensor>&; <template-parameter-2-3> = void; _T1 = at::Tensor; _T2 = std::vector<at::Tensor>]'
return {weight_buf, params_arr};
```
This regression was introduced by #42385
Fixes #{issue number}
Pull Request resolved: #43244
Reviewed By: pbelevich
Differential Revision: D23205656
Pulled By: malfet
fbshipit-source-id: 51470386ad95290c7c99d733fc1fe655aa27d009
Should close #36428.
The cudnn RNN API expects weights to occupy a flat buffer in memory with a particular layout. This PR implements a "speed of light" fix:
_cudnn_rnn_cast_reflatten(the autocast wrapper assigned to_cudnn_rnn) copies weights to the right slices of a flat FP16 buffer with a single read/write per weight (as opposed to casting them to FP16 individually then reflattening the individual FP16 weights, which would require 2 read/writes per weight).It isn't pretty but IMO it doesn't make rnn bindings much more tortuous than they already are.
The test tries a forward under autocast and a backward for the full cross product of RNN options and input/weight/hidden dtypes. As for all FP16list autocast tests, forward output and backward grads are checked against a control where inputs (including RNN module weights in this case) are precasted to FP16 on the python side.
Not sure who to ask for review, tagging @ezyang and @ngimel because Ed wrote this file (almost 2 years ago) and Natalia did the most recent major surgery.
Side quests discovered: