Skip to content

Conversation

@H-Huang
Copy link
Member

@H-Huang H-Huang commented Nov 2, 2022

Stack from ghstack:

Changes

  • refactor parts of _new_process_group_helper() to remove repeated code

Context

#86225

Differential Revision: D41188274

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 2, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88351

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a73a1c5:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ated code"


Changes:
- refactor parts of `_new_process_group_helper()` to remove repeated code

[ghstack-poisoned]
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for refactoring.
Just added minor comments.

if backend in [Backend.GLOO, Backend.NCCL, Backend.UCC]:
# In debug mode and if GLOO is available, wrap in a wrapper PG that
# enables enhanced collective checking for debugability.
if get_debug_level() == DebugLevel.DETAIL:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if condition should be put out-most.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can move it outmost, looks a bit cleaner that way

GLOO is not available. Build with Gloo to
create a wrapper process group in debug mode
to aid collective desynchronization debugging."""
)
Copy link
Collaborator

@kwen2501 kwen2501 Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the current out-most condition as an elif here:

elif backend not in [Backend.GLOO, Backend.NCCL, Backend.UCC]:
  logger.info(
    """TORCH_DISTRIBUTED_DEBUG was set to DETAIL, but the current backend does not support this feature, ignoring."""
  )
else:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we want to add a log for this? We currently don't have a message. I am also not sure if there are things other than ProcessGroupWrapper that is available when dist.DebugLevel.DETAIL is set, maybe the logs are also more verbose so maybe we should just say the current backend does not support ProcessGroupWrapper?

…ated code"


Changes:
- refactor parts of `_new_process_group_helper()` to remove repeated code

[ghstack-poisoned]
Copy link
Collaborator

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing commits as well I see...

# Process group wrapper initialization for supported PGs when TORCH_DISTRIBUTED_DEBUG is set
if backend in [Backend.GLOO, Backend.NCCL, Backend.UCC]:
# In debug mode and if GLOO is available, wrap in a wrapper PG that
# enables enhanced collective checking for debugability.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
# enables enhanced collective checking for debugability.
# enables enhanced collective checking for debuggability.

https://wikidiff.com/debugability/debuggability

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you sir

…ated code"


Changes:
- refactor parts of `_new_process_group_helper()` to remove repeated code

[ghstack-poisoned]
@H-Huang H-Huang added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 10, 2022
@H-Huang
Copy link
Member Author

H-Huang commented Nov 10, 2022

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@H-Huang
Copy link
Member Author

H-Huang commented Nov 10, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@H-Huang
Copy link
Member Author

H-Huang commented Nov 11, 2022

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…ytorch#88351)

Changes:
- refactor parts of `_new_process_group_helper()` to remove repeated code

Differential Revision: [D41188274](https://our.internmc.facebook.com/intern/diff/D41188274)
Pull Request resolved: pytorch#88351
Approved by: https://github.com/kwen2501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants