Skip to content

Conversation

@mcarilli
Copy link
Collaborator

@mcarilli mcarilli commented Feb 22, 2022

#62143 was reverted (#72089) because, when running native tests internally with cudnn and GPUs such that CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H was used, we hit some CUDNN_STATUS_NOT_SUPPORTED errors.

Based on https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#features-of-rnn-functions and experiments, I strongly suspect the errors were because CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H doesn't support variable sequence lengths in the batch.

This PR restores #62143 and adds a bailout condition if the input is a packed batch that might have different sequence lengths per element.

Question for review: Do we also need to add a bailout condition if the input is double precision?

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 22, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/mcarilli/pytorch/blob/b65833259d626b5e0e14ab0dbdb4e8f5b5b01f69/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-manywheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
macos-arm64-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-arm64-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
macos-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla 🚫 skipped

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 22, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit eeb5945 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@ngimel
Copy link
Collaborator

ngimel commented Feb 22, 2022

What about regular persistent algorithm? Should it bail out when inputs are packed? Does it do it (if so, where? It's not in the algo selection function)?
Also, for regular persistent algo I see a bunch of checks in the sizes in use_persist_common_heuristics, should small_h also perform these checks?

@xwang233
Copy link
Collaborator

CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H is not supported for double precision and we shouldn't use it. It was causing lots of RNN_float64 unit test failures.

@mcarilli
Copy link
Collaborator Author

mcarilli commented Feb 22, 2022

The other persistent algo already bails out if the input is packed (and it only triggers for fp16 data).

@mcarilli
Copy link
Collaborator Author

mcarilli commented Feb 22, 2022

I reworked the checks, hopefully good now.

@ngimel
Copy link
Collaborator

ngimel commented Feb 22, 2022

Does PERSIST_STATIC_SMALL_H support arbitrary hidden and input sizes? For regular persist algorithm there are checks that those are multiples of 128 or 64 or something.

@mcarilli
Copy link
Collaborator Author

Does PERSIST_STATIC_SMALL_H support arbitrary hidden and input sizes? For regular persist algorithm there are checks that those are multiples of 128 or 64 or something.

No idea. @xwang233 did you discuss this at all with cudnn people during the original PR?

@xwang233
Copy link
Collaborator

Only small hidden sizes are supported and those are already in the PR code. All input sizes should be supported.

@saketh-are saketh-are added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 27, 2022
@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Mar 7, 2022
…N hidden_size (#73211)

Summary:
#62143 was reverted (#72089) because, when running native tests internally with cudnn and GPUs such that `CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H` was used, we hit some `CUDNN_STATUS_NOT_SUPPORTED` errors.

Based on https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#features-of-rnn-functions and experiments, I strongly suspect the errors were because `CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H` doesn't support variable sequence lengths in the batch.

This PR restores #62143 and adds a bailout condition if the input is a packed batch that might have different sequence lengths per element.

Question for review: Do we also need to add a bailout condition if the input is double precision?

Pull Request resolved: #73211

Reviewed By: ejguan

Differential Revision: D34688016

Pulled By: ngimel

fbshipit-source-id: e7335c4701dabc7d0b36ebdb6414c4353a71ee91
@github-actions
Copy link
Contributor

github-actions bot commented Mar 7, 2022

Hey @mcarilli.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 9, 2022
…N hidden_size (#73211)

Summary:
pytorch/pytorch#62143 was reverted (pytorch/pytorch#72089) because, when running native tests internally with cudnn and GPUs such that `CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H` was used, we hit some `CUDNN_STATUS_NOT_SUPPORTED` errors.

Based on https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#features-of-rnn-functions and experiments, I strongly suspect the errors were because `CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H` doesn't support variable sequence lengths in the batch.

This PR restores pytorch/pytorch#62143 and adds a bailout condition if the input is a packed batch that might have different sequence lengths per element.

Question for review: Do we also need to add a bailout condition if the input is double precision?

Pull Request resolved: pytorch/pytorch#73211

Reviewed By: ejguan

Differential Revision: D34688016

Pulled By: ngimel

fbshipit-source-id: e7335c4701dabc7d0b36ebdb6414c4353a71ee91
(cherry picked from commit b9023bfd1c31eb9a38bf0552a20412e9a4e60b91)
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 9, 2022
…N hidden_size (#73211)

Summary:
pytorch/pytorch#62143 was reverted (pytorch/pytorch#72089) because, when running native tests internally with cudnn and GPUs such that `CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H` was used, we hit some `CUDNN_STATUS_NOT_SUPPORTED` errors.

Based on https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#features-of-rnn-functions and experiments, I strongly suspect the errors were because `CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H` doesn't support variable sequence lengths in the batch.

This PR restores pytorch/pytorch#62143 and adds a bailout condition if the input is a packed batch that might have different sequence lengths per element.

Question for review: Do we also need to add a bailout condition if the input is double precision?

Pull Request resolved: pytorch/pytorch#73211

Reviewed By: ejguan

Differential Revision: D34688016

Pulled By: ngimel

fbshipit-source-id: e7335c4701dabc7d0b36ebdb6414c4353a71ee91
(cherry picked from commit b9023bfd1c31eb9a38bf0552a20412e9a4e60b91)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants