-
Notifications
You must be signed in to change notification settings - Fork 26.3k
(torchx/elastic) honor NCCL_ASYNC_ERROR_HANDLING set from the env var #73982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 5b14ea7 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
This pull request was exported from Phabricator. Differential Revision: D34765786 |
e3564f3 to
ef01899
Compare
|
This pull request was exported from Phabricator. Differential Revision: D34765786 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D34765786 |
ef01899 to
6ef997e
Compare
…pytorch#73982) Summary: Pull Request resolved: pytorch#73982 Currently there is no way for users using torchelastic to override NCCL_ASYNC_ERROR_HANDLING=0. This PR enables this. Test Plan: Added unittests Manual testing ``` $ torchx run fb.dist.ddp -- --img torchx_examples -m print_env_vars.py --env NCCL_ASYNC_ERROR_HANDLING=0 ``` Validated the NCCL_ASYNC_ERROR_HANDLING in the process running `print_env_vars.py` is indeed `0`. Reviewed By: mannatsingh, aivanou Differential Revision: D34765786 fbshipit-source-id: f4cb5623aa49d9d40d509c9c01a293276c7b8ee6
|
This pull request was exported from Phabricator. Differential Revision: D34765786 |
6ef997e to
5b14ea7
Compare
…#73982) Summary: Pull Request resolved: #73982 Currently there is no way for users using torchelastic to override NCCL_ASYNC_ERROR_HANDLING=0. This PR enables this. Test Plan: Added unittests Manual testing ``` $ torchx run fb.dist.ddp -- --img torchx_examples -m print_env_vars.py --env NCCL_ASYNC_ERROR_HANDLING=0 ``` Validated the NCCL_ASYNC_ERROR_HANDLING in the process running `print_env_vars.py` is indeed `0`. Reviewed By: mannatsingh, aivanou Differential Revision: D34765786 fbshipit-source-id: 3f9f6d3b61e7d265adf689d387e020ab534c9259
|
Hey @kiukchung. |
Differential Revision: D34765786