Skip to content

Conversation

@osalpekar
Copy link
Member

@osalpekar osalpekar commented Sep 4, 2020

Stack from ghstack:

In this PR, we introduce a new environment variable
(NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling
feature. We intend to eventually turn this feature on by default for all users,
but this is a temporary solution so the change in behavior from hanging to
crashing is not the default for users all of a sudden.

Differential Revision: D23517895

In this PR, we introduce a new environment variable
(NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling
feature. We intend to eventually turn this feature on by default for all users,
but this is a temporary solution so the change in behavior from hanging to
crashing is not the default for users all of a sudden.

Differential Revision: [D23517895](https://our.internmc.facebook.com/intern/diff/D23517895/)

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Sep 4, 2020
In this PR, we introduce a new environment variable
(NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling
feature. We intend to eventually turn this feature on by default for all users,
but this is a temporary solution so the change in behavior from hanging to
crashing is not the default for users all of a sudden.

Differential Revision: [D23517895](https://our.internmc.facebook.com/intern/diff/D23517895/)

ghstack-source-id: 111402543
Pull Request resolved: #44163
@dr-ci
Copy link

dr-ci bot commented Sep 4, 2020

💊 CI failures summary and remediations

As of commit 7109ebc (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 8 times.

}
}

void ProcessGroupNCCL::parseNcclAsyncErrorHandling() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function looks good. Wondering if we can reuse parseNcclBlockingWait() somehow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be doable, I've made an issue to follow up on this: #44205 (mainly because we might be able to make this helper function more broadly available in the PyTorch codebase)

@codecov
Copy link

codecov bot commented Sep 4, 2020

Codecov Report

Merging #44163 into gh/osalpekar/81/base will decrease coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@                   Coverage Diff                    @@
##           gh/osalpekar/81/base   #44163      +/-   ##
========================================================
- Coverage                 69.24%   69.24%   -0.01%     
========================================================
  Files                       381      381              
  Lines                     47573    47573              
========================================================
- Hits                      32943    32942       -1     
- Misses                    14630    14631       +1     
Impacted Files Coverage Δ
torch/testing/_internal/expecttest.py 77.55% <0.00%> (-1.03%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f19707c...7109ebc. Read the comment docs.

…ing feature"

In this PR, we introduce a new environment variable
(NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling
feature. We intend to eventually turn this feature on by default for all users,
but this is a temporary solution so the change in behavior from hanging to
crashing is not the default for users all of a sudden.

Differential Revision: [D23517895](https://our.internmc.facebook.com/intern/diff/D23517895/)

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Sep 8, 2020
Pull Request resolved: #44163

In this PR, we introduce a new environment variable
(NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling
feature. We intend to eventually turn this feature on by default for all users,
but this is a temporary solution so the change in behavior from hanging to
crashing is not the default for users all of a sudden.
ghstack-source-id: 111614317

Differential Revision: [D23517895](https://our.internmc.facebook.com/intern/diff/D23517895/)
…ing feature"

In this PR, we introduce a new environment variable
(NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling
feature. We intend to eventually turn this feature on by default for all users,
but this is a temporary solution so the change in behavior from hanging to
crashing is not the default for users all of a sudden.

Differential Revision: [D23517895](https://our.internmc.facebook.com/intern/diff/D23517895/)

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Sep 9, 2020
Pull Request resolved: #44163

In this PR, we introduce a new environment variable
(NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling
feature. We intend to eventually turn this feature on by default for all users,
but this is a temporary solution so the change in behavior from hanging to
crashing is not the default for users all of a sudden.
ghstack-source-id: 111637788

Differential Revision: [D23517895](https://our.internmc.facebook.com/intern/diff/D23517895/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 48c47db.

@facebook-github-bot facebook-github-bot deleted the gh/osalpekar/81/head branch September 13, 2020 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants