-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[14/N] Refactor _new_process_group_helper() to remove repeated code #88351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88351
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit a73a1c5: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…ated code" Changes: - refactor parts of `_new_process_group_helper()` to remove repeated code [ghstack-poisoned]
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for refactoring.
Just added minor comments.
| if backend in [Backend.GLOO, Backend.NCCL, Backend.UCC]: | ||
| # In debug mode and if GLOO is available, wrap in a wrapper PG that | ||
| # enables enhanced collective checking for debugability. | ||
| if get_debug_level() == DebugLevel.DETAIL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if condition should be put out-most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I can move it outmost, looks a bit cleaner that way
| GLOO is not available. Build with Gloo to | ||
| create a wrapper process group in debug mode | ||
| to aid collective desynchronization debugging.""" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the current out-most condition as an elif here:
elif backend not in [Backend.GLOO, Backend.NCCL, Backend.UCC]:
logger.info(
"""TORCH_DISTRIBUTED_DEBUG was set to DETAIL, but the current backend does not support this feature, ignoring."""
)
else:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we want to add a log for this? We currently don't have a message. I am also not sure if there are things other than ProcessGroupWrapper that is available when dist.DebugLevel.DETAIL is set, maybe the logs are also more verbose so maybe we should just say the current backend does not support ProcessGroupWrapper?
…ated code" Changes: - refactor parts of `_new_process_group_helper()` to remove repeated code [ghstack-poisoned]
awgu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushing commits as well I see...
| # Process group wrapper initialization for supported PGs when TORCH_DISTRIBUTED_DEBUG is set | ||
| if backend in [Backend.GLOO, Backend.NCCL, Backend.UCC]: | ||
| # In debug mode and if GLOO is available, wrap in a wrapper PG that | ||
| # enables enhanced collective checking for debugability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| # enables enhanced collective checking for debugability. | |
| # enables enhanced collective checking for debuggability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you sir
…ated code" Changes: - refactor parts of `_new_process_group_helper()` to remove repeated code [ghstack-poisoned]
|
@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…ytorch#88351) Changes: - refactor parts of `_new_process_group_helper()` to remove repeated code Differential Revision: [D41188274](https://our.internmc.facebook.com/intern/diff/D41188274) Pull Request resolved: pytorch#88351 Approved by: https://github.com/kwen2501
Stack from ghstack:
Changes
_new_process_group_helper()to remove repeated codeContext
#86225
Differential Revision: D41188274