-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[PyTorch Distributed] Add debug hint for NCCL async system error #73897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit cce2fdb (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
This pull request was exported from Phabricator. Differential Revision: D34702348 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to extend the error message directly here: https://github.com/pytorch/pytorch/blob/master/torch%2Fcsrc%2Fdistributed%2Fc10d%2FNCCLUtils.hpp#L28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate your comment @pritamdamania87 !
I put the additional message here because I need to rely on the fact that this is an asynchronous NCCL error (i.e. dynamic during a NCCL operation), rather than applying the hint to all ncclSystemError's which may also include those immediately returned during NCCL initialization. Errors like "connection closed by remote peer" most happen asynchronously during NCCL execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated code change to reflect @pritamdamania87 's comment.
rohan-varma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM given @kwen2501 's reasoning on the response to Pritam's comment. Although let's wait for @pritamdamania87 to confirm.
|
This pull request was exported from Phabricator. Differential Revision: D34702348 |
1ea64bc to
fb3b272
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention this instead:
It can be also caused by unexpected exit of a remote peer, please check NCCL logs (after enabling NCCL_DEBUG=WARN or INFO) for the exact reason of the failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but in a separate PR maybe we should add the line "Please check NCCL logs (after enabling NCCL_DEBUG=WARN or INFO) for the exact reason of the failure." to all other messages too.
…orch#73897) Summary: Pull Request resolved: pytorch#73897 add a debug hint that async system error can be caused by unexpected exit of a remote process if not an actual network issue. For example, the exit of the remote process can cause a closed network connection error at a local process. The hint helps to direct the debug focus to the remote process. Test Plan: unit tests Reviewed By: pritamdamania87, rohan-varma Differential Revision: D34702348 fbshipit-source-id: 85ccebc25e4c3a685dfb7d2bc2d981778cd08cd7
|
This pull request was exported from Phabricator. Differential Revision: D34702348 |
fb3b272 to
cce2fdb
Compare
) Summary: Pull Request resolved: #73897 add a debug hint that async system error can be caused by unexpected exit of a remote process if not an actual network issue. For example, the exit of the remote process can cause a closed network connection error at a local process. The hint helps to direct the debug focus to the remote process. Test Plan: unit tests Reviewed By: pritamdamania87, rohan-varma Differential Revision: D34702348 fbshipit-source-id: d19f9116e9efe5f6d76c0158a7a447616437ca69
|
Hey @kwen2501. |
Summary:
add a debug hint that async system error can be caused by unexpected exit of
a remote process if not an actual network issue. For example, the exit of the remote process
can cause a closed network connection error at a local process. The hint helps to direct
the debug focus to the remote process.
Test Plan: unit tests
Differential Revision: D34702348