[RELAND] [cuDNN] Add a new optimized cuDNN RNN algorithm for small RNN hidden_size #73211

mcarilli · 2022-02-22T15:50:18Z

#62143 was reverted (#72089) because, when running native tests internally with cudnn and GPUs such that CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H was used, we hit some CUDNN_STATUS_NOT_SUPPORTED errors.

Based on https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#features-of-rnn-functions and experiments, I strongly suspect the errors were because CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H doesn't support variable sequence lengths in the batch.

This PR restores #62143 and adds a bailout condition if the input is a packed batch that might have different sequence lengths per element.

Question for review: Do we also need to add a bailout condition if the input is double precision?

…MALL_H algorithm (pytorch#72089)" This reverts commit c93d6f9.

pytorch-bot · 2022-02-22T15:50:22Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/mcarilli/pytorch/blob/b65833259d626b5e0e14ab0dbdb4e8f5b5b01f69/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
linux-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
linux-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
linux-binary-manywheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	✅ triggered
linux-bionic-rocm4.5-py3.7	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/rocm`, `ciflow/trunk`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
macos-arm64-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-arm64-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
macos-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
windows-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
windows-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
windows-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`, `ciflow/xla`	🚫 skipped

facebook-github-bot · 2022-02-22T15:50:23Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/73211
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit eeb5945 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ngimel · 2022-02-22T17:39:41Z

What about regular persistent algorithm? Should it bail out when inputs are packed? Does it do it (if so, where? It's not in the algo selection function)?
Also, for regular persistent algo I see a bunch of checks in the sizes in use_persist_common_heuristics, should small_h also perform these checks?

xwang233 · 2022-02-22T17:49:40Z

CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H is not supported for double precision and we shouldn't use it. It was causing lots of RNN_float64 unit test failures.

mcarilli · 2022-02-22T17:59:38Z

The other persistent algo already bails out if the input is packed (and it only triggers for fp16 data).

mcarilli · 2022-02-22T18:55:07Z

I reworked the checks, hopefully good now.

ngimel · 2022-02-22T19:04:16Z

Does PERSIST_STATIC_SMALL_H support arbitrary hidden and input sizes? For regular persist algorithm there are checks that those are multiples of 128 or 64 or something.

mcarilli · 2022-02-22T19:47:57Z

Does PERSIST_STATIC_SMALL_H support arbitrary hidden and input sizes? For regular persist algorithm there are checks that those are multiples of 128 or 64 or something.

No idea. @xwang233 did you discuss this at all with cudnn people during the original PR?

xwang233 · 2022-02-22T19:52:36Z

Only small hidden sizes are supported and those are already in the PR code. All input sizes should be supported.

facebook-github-bot · 2022-03-07T17:26:07Z

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…N hidden_size (#73211) Summary: #62143 was reverted (#72089) because, when running native tests internally with cudnn and GPUs such that `CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H` was used, we hit some `CUDNN_STATUS_NOT_SUPPORTED` errors. Based on https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#features-of-rnn-functions and experiments, I strongly suspect the errors were because `CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H` doesn't support variable sequence lengths in the batch. This PR restores #62143 and adds a bailout condition if the input is a packed batch that might have different sequence lengths per element. Question for review: Do we also need to add a bailout condition if the input is double precision? Pull Request resolved: #73211 Reviewed By: ejguan Differential Revision: D34688016 Pulled By: ngimel fbshipit-source-id: e7335c4701dabc7d0b36ebdb6414c4353a71ee91

github-actions · 2022-03-07T23:01:02Z

Hey @mcarilli.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

…N hidden_size (#73211) Summary: pytorch/pytorch#62143 was reverted (pytorch/pytorch#72089) because, when running native tests internally with cudnn and GPUs such that `CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H` was used, we hit some `CUDNN_STATUS_NOT_SUPPORTED` errors. Based on https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#features-of-rnn-functions and experiments, I strongly suspect the errors were because `CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H` doesn't support variable sequence lengths in the batch. This PR restores pytorch/pytorch#62143 and adds a bailout condition if the input is a packed batch that might have different sequence lengths per element. Question for review: Do we also need to add a bailout condition if the input is double precision? Pull Request resolved: pytorch/pytorch#73211 Reviewed By: ejguan Differential Revision: D34688016 Pulled By: ngimel fbshipit-source-id: e7335c4701dabc7d0b36ebdb6414c4353a71ee91 (cherry picked from commit b9023bfd1c31eb9a38bf0552a20412e9a4e60b91)

Michael Carilli added 2 commits February 22, 2022 08:30

Revert "Revert pytorch#62143, the new CUDNN_RNN_ALGO_PERSIST_STATIC_S…

ab93a29

…MALL_H algorithm (pytorch#72089)" This reverts commit c93d6f9.

Avoid CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H for packed sequences

b658332

pytorch-bot bot added the ciflow/default label Feb 22, 2022

facebook-github-bot added the cla signed label Feb 22, 2022

mcarilli requested review from ngimel and xwang233 February 22, 2022 15:50

pytorchbot added the open source label Feb 22, 2022

rework to avoid PERSIST_STATIC_SMALL_H with fp64 inputs

eeb5945

saketh-are added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 27, 2022

ngimel approved these changes Mar 7, 2022

View reviewed changes

pytorchmergebot closed this in 943085f Mar 7, 2022

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RELAND] [cuDNN] Add a new optimized cuDNN RNN algorithm for small RNN hidden_size #73211

[RELAND] [cuDNN] Add a new optimized cuDNN RNN algorithm for small RNN hidden_size #73211

Uh oh!

mcarilli commented Feb 22, 2022 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 22, 2022

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Feb 22, 2022 •

edited

Loading

Uh oh!

ngimel commented Feb 22, 2022

Uh oh!

xwang233 commented Feb 22, 2022

Uh oh!

mcarilli commented Feb 22, 2022 •

edited

Loading

Uh oh!

mcarilli commented Feb 22, 2022 •

edited

Loading

Uh oh!

ngimel commented Feb 22, 2022

Uh oh!

mcarilli commented Feb 22, 2022

Uh oh!

xwang233 commented Feb 22, 2022

Uh oh!

facebook-github-bot commented Mar 7, 2022

Uh oh!

github-actions bot commented Mar 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[RELAND] [cuDNN] Add a new optimized cuDNN RNN algorithm for small RNN hidden_size #73211

[RELAND] [cuDNN] Add a new optimized cuDNN RNN algorithm for small RNN hidden_size #73211

Uh oh!

Conversation

mcarilli commented Feb 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 22, 2022

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Feb 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

ngimel commented Feb 22, 2022

Uh oh!

xwang233 commented Feb 22, 2022

Uh oh!

mcarilli commented Feb 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcarilli commented Feb 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Feb 22, 2022

Uh oh!

mcarilli commented Feb 22, 2022

Uh oh!

xwang233 commented Feb 22, 2022

Uh oh!

facebook-github-bot commented Mar 7, 2022

Uh oh!

github-actions bot commented Mar 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mcarilli commented Feb 22, 2022 •

edited

Loading

facebook-github-bot commented Feb 22, 2022 •

edited

Loading

mcarilli commented Feb 22, 2022 •

edited

Loading

mcarilli commented Feb 22, 2022 •

edited

Loading