Skip to content

Conversation

@mingzhe09088
Copy link
Contributor

Summary: There is a case when PG cleanup thread checks cuda event status after cuda runtime library has been unloaded. When that happens, it would leads to a "driver shutting down" error. This issue usually happens when cuda API is called in global or static object destructor.

Test Plan: wait for user

Differential Revision: D34904896

Summary: There is a case when PG cleanup thread checks cuda event status after cuda runtime library has been unloaded. When that happens, it would leads to a "driver shutting down" error. This issue usually happens when cuda API is called in global or static object destructor.

Test Plan: wait for user

Differential Revision: D34904896

fbshipit-source-id: a2846050f0f7b37742a9e0d79e13f3b7b05d1fad
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 15, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/mingzhe09088/pytorch-1/blob/2f4ab77c7a215e29225f238d6fa9a2386d87bf02/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
linux-binary-manywheel ciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build ciflow/all, ciflow/cpu, ciflow/default, ciflow/libtorch, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
macos-arm64-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-arm64-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
macos-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
windows-binary-libtorch-debug ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
windows-binary-libtorch-release ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
windows-binary-wheel ciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-bionic-rocm4.5-py3.7-distributed ciflow/all, ciflow/linux, ciflow/rocm, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla 🚫 skipped

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 15, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 2f4ab77 (more details on the Dr. CI page):


  • 2/8 failures introduced in this PR
  • 6/8 broken upstream at merge base 770da30 on Mar 15 from 1:31pm to 5:00pm

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge) (1/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-15T22:30:48.6944206Z NameError: name 'get_all_int_dtypes' is not defined
2022-03-15T22:30:48.6326824Z NameError: name 'get_all_int_dtypes' is not defined
2022-03-15T22:30:48.6327100Z 
2022-03-15T22:30:48.6327528Z 🚨 ERROR: TestBinaryUfuncsCPU.test_sub_cpu_uint8
2022-03-15T22:30:48.6940011Z Traceback (most recent call last):
2022-03-15T22:30:48.6940985Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 376, in instantiated_test
2022-03-15T22:30:48.6941683Z     result = test(self, **param_kwargs)
2022-03-15T22:30:48.6942362Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 918, in only_fn
2022-03-15T22:30:48.6942974Z     return fn(slf, *args, **kwargs)
2022-03-15T22:30:48.6943401Z   File "test_binary_ufuncs.py", line 2052, in test_sub
2022-03-15T22:30:48.6943806Z     if dtype in get_all_int_dtypes():
2022-03-15T22:30:48.6944206Z NameError: name 'get_all_int_dtypes' is not defined
2022-03-15T22:30:48.6944486Z 
2022-03-15T22:30:48.6950216Z ✅ 15835 Passed
2022-03-15T22:30:48.6950511Z 💨 8150 Skipped
2022-03-15T22:30:48.6950780Z 🚨 12 Failed
2022-03-15T22:30:48.7461460Z ##[group]Run .github\scripts\wait_for_ssh_to_drain.ps1
2022-03-15T22:30:48.7462038Z �[36;1m.github\scripts\wait_for_ssh_to_drain.ps1�[0m
2022-03-15T22:30:48.7479255Z shell: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.EXE -command ". '{0}'"
2022-03-15T22:30:48.7479743Z env:
2022-03-15T22:30:48.7480097Z   BUILD_ENVIRONMENT: win-vs2019-cpu-py3
2022-03-15T22:30:48.7480459Z   BUILD_WHEEL: 1

See GitHub Actions build linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge) (2/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-15T22:26:11.6645307Z NameError: name 'get_all_int_dtypes' is not defined
2022-03-15T22:26:11.6523769Z 		
2022-03-15T22:26:11.6524015Z 🚨 ERROR: TestBinaryUfuncsCPU.test_sub_cpu_uint8
2022-03-15T22:26:11.6642622Z 
2022-03-15T22:26:11.6642876Z Traceback (most recent call last):
2022-03-15T22:26:11.6643486Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 376, in instantiated_test
2022-03-15T22:26:11.6643951Z     result = test(self, **param_kwargs)
2022-03-15T22:26:11.6644328Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 918, in only_fn
2022-03-15T22:26:11.6644606Z     return fn(slf, *args, **kwargs)
2022-03-15T22:26:11.6644813Z   File "test_binary_ufuncs.py", line 2052, in test_sub
2022-03-15T22:26:11.6645033Z     if dtype in get_all_int_dtypes():
2022-03-15T22:26:11.6645307Z NameError: name 'get_all_int_dtypes' is not defined
2022-03-15T22:26:11.6645486Z 		
2022-03-15T22:26:11.6645687Z ✅ 13480 Passed
2022-03-15T22:26:11.6646348Z 💨 7064 Skipped
2022-03-15T22:26:11.6649569Z 🚨 12 Failed
2022-03-15T22:26:11.7029057Z ##[group]Run # Remove any previous test jsons if they exist
2022-03-15T22:26:11.7029434Z �[36;1m# Remove any previous test jsons if they exist�[0m
2022-03-15T22:26:11.7029642Z �[36;1mrm -f test-jsons-*.zip�[0m
2022-03-15T22:26:11.7029874Z �[36;1mzip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'�[0m
2022-03-15T22:26:11.7041166Z shell: /usr/bin/bash -e {0}
2022-03-15T22:26:11.7041349Z env:

🚧 6 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue fb-exported labels Mar 15, 2022
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34904896

facebook-github-bot pushed a commit that referenced this pull request Mar 18, 2022
Summary:
Pull Request resolved: #74258

There is a case when PG cleanup thread checks cuda event status after cuda runtime library has been unloaded. When that happens, it would leads to a "driver shutting down" error. This issue usually happens when cuda API is called in global or static object destructor.

Test Plan: wait for user

Reviewed By: jiayisuse, osalpekar

Differential Revision: D34904896

fbshipit-source-id: 705c0812132dae97ea55fcb22730557880ca35e1
@github-actions
Copy link
Contributor

Hey @mingzhe09088.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

}
}
} catch (const std::exception& e) {
if (std::string(e.what()).find("driver shutting down") == std::string::npos) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really fragile, is there a better way to detect this. Maybe using cudaGetLastError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the cuda library has been unloaded at this point, so any cuda runtime API would fail with "driver shutting down" error. If there is a way to prevent cuda library from unloading, that would solve the issue here.

if (std::string(e.what()).find("driver shutting down") == std::string::npos) {
throw;
}
LOG(INFO) << "[Rank " << rank_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be LOG(WARNING) or LOG(ERROR)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, which one do you prefer?

if (std::string(e.what()).find("driver shutting down") == std::string::npos) {
throw;
}
LOG(INFO) << "[Rank " << rank_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we swallow the exception and just continue here? Isn't this misleading to the user where they think the operation has completed but it actually has not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this happens, the operation has completed so user code is not affected. We just want PG to peacefully exit without aborting.

shahofblah pushed a commit that referenced this pull request Mar 25, 2022
Summary:
Pull Request resolved: #74258

There is a case when PG cleanup thread checks cuda event status after cuda runtime library has been unloaded. When that happens, it would leads to a "driver shutting down" error. This issue usually happens when cuda API is called in global or static object destructor.

Test Plan: wait for user

Reviewed By: jiayisuse, osalpekar

Differential Revision: D34904896

fbshipit-source-id: 705c0812132dae97ea55fcb22730557880ca35e1
(cherry picked from commit ecb5f14)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants