Skip to content

[DeviceMesh][ez] Extract the pg creation as a util function#163930

Closed
fduwjj wants to merge 2 commits intogh/fduwjj/210/basefrom
gh/fduwjj/210/head
Closed

[DeviceMesh][ez] Extract the pg creation as a util function#163930
fduwjj wants to merge 2 commits intogh/fduwjj/210/basefrom
gh/fduwjj/210/head

Conversation

@fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Sep 26, 2025

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163930

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bccf9e9 with merge base 5fcde74 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 26, 2025
fduwjj added a commit that referenced this pull request Sep 26, 2025
@fduwjj fduwjj requested review from ezyang, fegin and lw September 26, 2025 04:02
@fduwjj fduwjj added ciflow/trunk Trigger trunk jobs on your pull request release notes: DeviceMesh labels Sep 26, 2025
This is just to extract common logic into a util function because we will use it many times for the following stack of Device Mesh refactoring.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Sep 26, 2025
@fduwjj
Copy link
Contributor Author

fduwjj commented Sep 26, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Sep 27, 2025
)

While refactoring the bookkeeping for DeviceMesh while leveraging CuTe layout, we found that we need to have two more util functions. One is to check whether one layout has overlap inside it or not. For example, (2,2):(2:1) has no overlap while (2,2):(2:2) has overlap.

Pull Request resolved: #163367
Approved by: https://github.com/fegin
ghstack dependencies: #163212, #163288, #163928, #163930
jainapurva pushed a commit that referenced this pull request Sep 29, 2025
This is just to extract common logic into a util function because we will use it many times for the following stack of Device Mesh refactoring.

Pull Request resolved: #163930
Approved by: https://github.com/fegin
ghstack dependencies: #163212, #163288, #163928
jainapurva pushed a commit that referenced this pull request Sep 29, 2025
)

While refactoring the bookkeeping for DeviceMesh while leveraging CuTe layout, we found that we need to have two more util functions. One is to check whether one layout has overlap inside it or not. For example, (2,2):(2:1) has no overlap while (2,2):(2:2) has overlap.

Pull Request resolved: #163367
Approved by: https://github.com/fegin
ghstack dependencies: #163212, #163288, #163928, #163930
Comment on lines +896 to +904
mesh = DeviceMesh(
device_type,
mesh_nd,
mesh_dim_names=mesh_dim_names,
backend_override=backend_override,
_init_backend=_init_backend,
)
if cur_rank in mesh_nd:
res_mesh = mesh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for splitting this out as I had asked! I didn't get to review this before it was landed, but there's still something I don't understand. I get that we need to call the "PG creation API" multiple times on each rank, even for the ranks that don't participate, but I don't get why we need to call the DeviceMesh constructor multiple times!

Could we instead call the PG creation API directly and just invoke the DeviceMesh constructor once with the right PG?

maggiemoss pushed a commit to maggiemoss/pytorch that referenced this pull request Sep 29, 2025
…163930)

This is just to extract common logic into a util function because we will use it many times for the following stack of Device Mesh refactoring.

Pull Request resolved: pytorch#163930
Approved by: https://github.com/fegin
ghstack dependencies: pytorch#163212, pytorch#163288, pytorch#163928
maggiemoss pushed a commit to maggiemoss/pytorch that referenced this pull request Sep 29, 2025
…rch#163367)

While refactoring the bookkeeping for DeviceMesh while leveraging CuTe layout, we found that we need to have two more util functions. One is to check whether one layout has overlap inside it or not. For example, (2,2):(2:1) has no overlap while (2,2):(2:2) has overlap.

Pull Request resolved: pytorch#163367
Approved by: https://github.com/fegin
ghstack dependencies: pytorch#163212, pytorch#163288, pytorch#163928, pytorch#163930
@github-actions github-actions bot deleted the gh/fduwjj/210/head branch October 30, 2025 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: DeviceMesh

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants