[doc] Add LSTM non-deterministic workaround #40893

xwang233 · 2020-07-02T02:00:16Z

Related: #35661

Preview

xwang233 · 2020-07-02T02:00:46Z

cc @ptrblck

dr-ci · 2020-07-02T02:06:21Z

💊 CI failures summary and remediations

As of commit 7f71fa0 (more details on the Dr. CI page):

2/13 failures possibly* introduced in this PR
- 1/2 non-CircleCI failure(s)
4/13 tentatively recognized as flaky ❄️
- Click here to rerun these jobs
7/13 broken upstream at merge base 38b465d on Jul 07 from 3:37pm to 6:28pm PDT (2 commits; 38b465d - a4fd490)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_build (1/1)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .circleci/docker/build.sh 
Auto-merging .circleci/docker/build.sh 
CONFLICT (add/add): Merge conflict in .circleci/config.yml 
Auto-merging .circleci/config.yml 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/util/docker_constants.py 
Auto-merging .circleci/cimodel/data/simple/util/docker_constants.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py 
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py 
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py 
Auto-merging .circleci/cimodel/data/pytorch_build_data.py 
Automatic merge failed; fix conflicts and then commit the result.

❄️ 4 failures tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_linux_xenial_py3_clang5_mobile_code_analysis (1/4)

Step: "Build" (full log | diagnosis details | 🔁 rerun) ❄️

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:fff7795428560442086f7b2bb6004b65245dc11a not found

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:fff7795428560442086f7b2bb6004b65245dc11a 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:fff7795428560442086f7b2bb6004b65245dc11a not found

pytorch_linux_xenial_py3_clang5_android_ndk_r19c_vulkan_x86_32_build (2/4)

Step: "Build" (full log | diagnosis details | 🔁 rerun) ❄️

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:fff7795428560442086f7b2bb6004b65245dc11a not found

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:fff7795428560442086f7b2bb6004b65245dc11a 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:fff7795428560442086f7b2bb6004b65245dc11a not found

pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build (3/4)

Step: "Build" (full log | diagnosis details | 🔁 rerun) ❄️

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:fff7795428560442086f7b2bb6004b65245dc11a not found

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:fff7795428560442086f7b2bb6004b65245dc11a 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:fff7795428560442086f7b2bb6004b65245dc11a not found

pytorch_linux_xenial_py3_clang5_mobile_custom_build_dynamic (4/4)

Step: "Build" (full log | diagnosis details | 🔁 rerun) ❄️

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:fff7795428560442086f7b2bb6004b65245dc11a not found

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:fff7795428560442086f7b2bb6004b65245dc11a 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:fff7795428560442086f7b2bb6004b65245dc11a not found

🚧 7 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_xla_linux_bionic_py3_6_clang9_build on Jul 07 from 3:37pm to 6:28pm PDT (2 commits; 38b465d - a4fd490)
- 🔁 rerun
pytorch_linux_xenial_py3_clang5_asan_build on Jul 07 from 3:37pm to 6:28pm PDT (2 commits; 38b465d - a4fd490)
- 🔁 rerun
pytorch_linux_bionic_py3_6_clang9_build on Jul 07 from 3:37pm to 6:28pm PDT (2 commits; 38b465d - a4fd490)
- 🔁 rerun
pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build on Jul 07 from 3:37pm to 6:28pm PDT (2 commits; 38b465d - a4fd490)
- 🔁 rerun
pytorch_libtorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build on Jul 07 from 3:37pm to 6:28pm PDT (2 commits; 38b465d - a4fd490)
- 🔁 rerun
pytorch_linux_xenial_py3_clang5_mobile_custom_build_static on Jul 07 from 3:37pm to 6:28pm PDT (2 commits; 38b465d - a4fd490)
- 🔁 rerun
pytorch_linux_xenial_py3_clang5_mobile_build on Jul 07 from 3:37pm to 6:28pm PDT (2 commits; 38b465d - a4fd490)
- 🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-xenial-rocm3.5.1-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 13 times.

ngimel · 2020-07-07T23:25:39Z

torch/nn/modules/rnn.py

        where :math:`k = \frac{1}{\text{hidden\_size}}`

+    .. warning::
+        There are known deterministic issues for LSTM using cuDNN 7.6.5, 8.0 on CUDA 10.1 or later.


known non-determinism issues. Is it for LSTM only, or RNN/GRU are also affected?

Is the non-deterministic behavior really only on those two versions of cuDNN and only if the version of CUDA is 10.1 or later?

It could be related to other RNN. I'll check that and add docs at other places if necessary.

You may also want to cover yourself and say "On some versions of cuDNN and CUDA..." It's not great, since then people will never know if they may hit this issue or not, but it's better than telling them it may only happen in cases X and Y and then seeing it happen in case Z, too.

Too bad there's no way to query for whether the function will be deterministic or not in the current environment, or request that it be run deterministically.

I tested on cuda 10.2, cudnn 7.6.5. RNN and LSTM are affected. GRU is deterministic.

torch/nn/modules/rnn.py

ngimel · 2020-07-07T23:33:24Z

Can you also cross reference this from https://pytorch.org/docs/stable/notes/randomness.html ? We try to keep a list of all non-deterministic ops in that note.

torch/nn/modules/rnn.py

mruberry · 2020-07-07T23:32:49Z

torch/nn/modules/rnn.py

+        This may affect performance.
+
+        On CUDA 10.2 or later, set environment variable
+        (note the leading colon symbol)


Which one is it?

Do you mean the CUBLAS_WORKSPACE_CONFIG values? Either one would be fine.

torch/nn/modules/rnn.py

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-07-22T00:15:21Z

@ngimel merged this pull request in 60e2baf.

LSTM non-deterministic doc

3d54991

xwang233 requested a review from apaszke as a code owner July 2, 2020 02:00

xwang233 requested a review from ngimel July 2, 2020 02:00

pytorchbot added the open source label Jul 2, 2020

lint

c47d5d5

ngimel reviewed Jul 7, 2020

View reviewed changes

mruberry removed the request for review from apaszke July 7, 2020 23:27

mruberry added module: docs Related to our documentation, both in docs/ and docblocks triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 7, 2020

mruberry reviewed Jul 7, 2020

View reviewed changes

xwang233 and others added 4 commits July 7, 2020 17:53

cudnn rnn determinism

8db059d

randomness

3f23a6e

Merge remote-tracking branch 'upstream/master' into lstm-doc-1

a648246

Update cudnn_rnn_determinism.rst

7f71fa0

ngimel approved these changes Jul 14, 2020

View reviewed changes

facebook-github-bot reviewed Jul 14, 2020

View reviewed changes

facebook-github-bot closed this in 60e2baf Jul 21, 2020

facebook-github-bot added the merged label Jul 22, 2020

mruberry added the Merged label Oct 28, 2020

[doc] Add LSTM non-deterministic workaround #40893

[doc] Add LSTM non-deterministic workaround #40893

Uh oh!

Conversation

xwang233 commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xwang233 commented Jul 2, 2020

Uh oh!

dr-ci bot commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_build (1/1)

❄️ 4 failures tentatively classified as flaky

pytorch_linux_xenial_py3_clang5_mobile_code_analysis (1/4)

pytorch_linux_xenial_py3_clang5_android_ndk_r19c_vulkan_x86_32_build (2/4)

pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build (3/4)

pytorch_linux_xenial_py3_clang5_mobile_custom_build_dynamic (4/4)

🚧 7 fixed upstream failures:

ci.pytorch.org: 1 failed

Uh oh!

ngimel Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

mruberry Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

xwang233 Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

mruberry Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

xwang233 Jul 8, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngimel commented Jul 7, 2020

Uh oh!

Uh oh!

Uh oh!

mruberry Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

xwang233 Jul 8, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xwang233 commented Jul 2, 2020 •

edited

Loading

dr-ci bot commented Jul 2, 2020 •

edited

Loading