-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[PG NCCL] catch cuda lib runtime error - driver shutting down #74258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: There is a case when PG cleanup thread checks cuda event status after cuda runtime library has been unloaded. When that happens, it would leads to a "driver shutting down" error. This issue usually happens when cuda API is called in global or static object destructor. Test Plan: wait for user Differential Revision: D34904896 fbshipit-source-id: a2846050f0f7b37742a9e0d79e13f3b7b05d1fad
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 2f4ab77 (more details on the Dr. CI page):
🕵️ 2 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
|
This pull request was exported from Phabricator. Differential Revision: D34904896 |
Summary: Pull Request resolved: #74258 There is a case when PG cleanup thread checks cuda event status after cuda runtime library has been unloaded. When that happens, it would leads to a "driver shutting down" error. This issue usually happens when cuda API is called in global or static object destructor. Test Plan: wait for user Reviewed By: jiayisuse, osalpekar Differential Revision: D34904896 fbshipit-source-id: 705c0812132dae97ea55fcb22730557880ca35e1
|
Hey @mingzhe09088. |
| } | ||
| } | ||
| } catch (const std::exception& e) { | ||
| if (std::string(e.what()).find("driver shutting down") == std::string::npos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems really fragile, is there a better way to detect this. Maybe using cudaGetLastError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the cuda library has been unloaded at this point, so any cuda runtime API would fail with "driver shutting down" error. If there is a way to prevent cuda library from unloading, that would solve the issue here.
| if (std::string(e.what()).find("driver shutting down") == std::string::npos) { | ||
| throw; | ||
| } | ||
| LOG(INFO) << "[Rank " << rank_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be LOG(WARNING) or LOG(ERROR)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, which one do you prefer?
| if (std::string(e.what()).find("driver shutting down") == std::string::npos) { | ||
| throw; | ||
| } | ||
| LOG(INFO) << "[Rank " << rank_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we swallow the exception and just continue here? Isn't this misleading to the user where they think the operation has completed but it actually has not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this happens, the operation has completed so user code is not affected. We just want PG to peacefully exit without aborting.
Summary: Pull Request resolved: #74258 There is a case when PG cleanup thread checks cuda event status after cuda runtime library has been unloaded. When that happens, it would leads to a "driver shutting down" error. This issue usually happens when cuda API is called in global or static object destructor. Test Plan: wait for user Reviewed By: jiayisuse, osalpekar Differential Revision: D34904896 fbshipit-source-id: 705c0812132dae97ea55fcb22730557880ca35e1 (cherry picked from commit ecb5f14)
Summary: There is a case when PG cleanup thread checks cuda event status after cuda runtime library has been unloaded. When that happens, it would leads to a "driver shutting down" error. This issue usually happens when cuda API is called in global or static object destructor.
Test Plan: wait for user
Differential Revision: D34904896