-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[NCCL] Add Environment Variable to guard Async Error Handling feature #44163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In this PR, we introduce a new environment variable (NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling feature. We intend to eventually turn this feature on by default for all users, but this is a temporary solution so the change in behavior from hanging to crashing is not the default for users all of a sudden. Differential Revision: [D23517895](https://our.internmc.facebook.com/intern/diff/D23517895/) [ghstack-poisoned]
In this PR, we introduce a new environment variable (NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling feature. We intend to eventually turn this feature on by default for all users, but this is a temporary solution so the change in behavior from hanging to crashing is not the default for users all of a sudden. Differential Revision: [D23517895](https://our.internmc.facebook.com/intern/diff/D23517895/) ghstack-source-id: 111402543 Pull Request resolved: #44163
💊 CI failures summary and remediationsAs of commit 7109ebc (more details on the Dr. CI page):
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 8 times. |
| } | ||
| } | ||
|
|
||
| void ProcessGroupNCCL::parseNcclAsyncErrorHandling() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function looks good. Wondering if we can reuse parseNcclBlockingWait() somehow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be doable, I've made an issue to follow up on this: #44205 (mainly because we might be able to make this helper function more broadly available in the PyTorch codebase)
Codecov Report
@@ Coverage Diff @@
## gh/osalpekar/81/base #44163 +/- ##
========================================================
- Coverage 69.24% 69.24% -0.01%
========================================================
Files 381 381
Lines 47573 47573
========================================================
- Hits 32943 32942 -1
- Misses 14630 14631 +1
Continue to review full report at Codecov.
|
…ing feature" In this PR, we introduce a new environment variable (NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling feature. We intend to eventually turn this feature on by default for all users, but this is a temporary solution so the change in behavior from hanging to crashing is not the default for users all of a sudden. Differential Revision: [D23517895](https://our.internmc.facebook.com/intern/diff/D23517895/) [ghstack-poisoned]
Pull Request resolved: #44163 In this PR, we introduce a new environment variable (NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling feature. We intend to eventually turn this feature on by default for all users, but this is a temporary solution so the change in behavior from hanging to crashing is not the default for users all of a sudden. ghstack-source-id: 111614317 Differential Revision: [D23517895](https://our.internmc.facebook.com/intern/diff/D23517895/)
…ing feature" In this PR, we introduce a new environment variable (NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling feature. We intend to eventually turn this feature on by default for all users, but this is a temporary solution so the change in behavior from hanging to crashing is not the default for users all of a sudden. Differential Revision: [D23517895](https://our.internmc.facebook.com/intern/diff/D23517895/) [ghstack-poisoned]
Pull Request resolved: #44163 In this PR, we introduce a new environment variable (NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling feature. We intend to eventually turn this feature on by default for all users, but this is a temporary solution so the change in behavior from hanging to crashing is not the default for users all of a sudden. ghstack-source-id: 111637788 Differential Revision: [D23517895](https://our.internmc.facebook.com/intern/diff/D23517895/)
|
This pull request has been merged in 48c47db. |
Stack from ghstack:
In this PR, we introduce a new environment variable
(NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling
feature. We intend to eventually turn this feature on by default for all users,
but this is a temporary solution so the change in behavior from hanging to
crashing is not the default for users all of a sudden.
Differential Revision: D23517895