[For discussion][DeviceMesh] Use a shared_state to cache pg per layout, root_mesh and rank_map #166010

fduwjj · 2025-10-21T18:43:33Z

Stack from ghstack (oldest at bottom):

-> [For discussion][DeviceMesh] Use a shared_state to cache pg per layout, root_mesh and rank_map #166010
[Device Mesh] Enable id based DeviceMesh universe with flag decouple_backend_at_save #167753
[Device Mesh] Add an option to decouple PGs when it comes device mesh save #167590
[Device Mesh][ez] Clean up unused parameters and duplicate codes #167581

To avoid creating extra PGs when not needed, (for example when calling unflatten many times and some dims share the same layout), we want to create a pg cache mechanism. We cache a PG by a pair of layout (_MeshLayout) and its pg_option, so if users flatten or unflatten into the same layout+pg option (by default pg_option will be None if not set, that way the cache key falls back to layout itself) we will not create process group if already created.

Also to further consolidate the bookkeeping of DM, we created a device_type, shared_state to store root_mesh, rank_map and pg caches. This way all shared info for a given device mesh universe now becomes a singleton.

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

…h and rank_map [ghstack-poisoned]

pytorch-bot · 2025-10-21T18:43:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166010

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures

As of commit 54abae7 with merge base a5f3035 ():

NEW FAILURES - The following jobs have failed:

Check mergeability of ghstack PR / ghstack-mergeability-check (gh)
RuntimeError: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 83fd6f1 returned non-zero exit code 1
pull / linux-jammy-py3.10-gcc11 / test (distributed, 1, 2, lf.linux.2xlarge) (gh)
test/distributed/tensor/test_redistribute.py::DistributeWithDeviceOrderTest::test_ordered_redistribute
pull / linux-jammy-py3.10-gcc11 / test (distributed, 2, 2, lf.linux.2xlarge) (gh)
test/distributed/tensor/test_redistribute.py::DistributeWithDeviceOrderTest::test_shard_order_same_data_as_strided_shard
trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (distributed, 1, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh)
test/distributed/tensor/test_redistribute.py::DistributeWithDeviceOrderTestWithLocalTensor::test_ordered_redistribute
trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (distributed, 2, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh)
test/distributed/tensor/test_redistribute.py::DistributeWithDeviceOrderTestWithLocalTensor::test_shard_order_same_data_as_strided_shard

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…h and rank_map ghstack-source-id: b50659e Pull Request resolved: #166010

…t, root_mesh and rank_map" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…h and rank_map ghstack-source-id: 1cea7a4 Pull Request resolved: #166010

…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…h and rank_map ghstack-source-id: 8865ebf Pull Request resolved: #166010

torch/distributed/device_mesh.py

…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…h and rank_map ghstack-source-id: f96d104 Pull Request resolved: #166010

…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…h and rank_map ghstack-source-id: cffc81f Pull Request resolved: #166010

lw

Thanks!

I think this PR is a great starting point for the discussion around PG caching, and defining a clear mental model for how different DeviceMesh should share state.

torch/distributed/device_mesh.py

lw · 2025-10-28T17:24:07Z

torch/distributed/device_mesh.py

        def _get_root_mesh(self) -> "DeviceMesh":
-            return self._root_mesh if self._root_mesh else self
+            return not_none(self._shared_state.get_root_mesh())


This is the only instance where we're accessing the root mesh of the shared state. In principle, once we introduce the concept of shared state, we can get rid of the concept of root meshes.

What is preventing us from simply removing this _get_root_mesh method?

If there's any internal usage of _get_root_mesh, could we find ways of codemod it away so that we can fully remove this method?

We might need more PRs to remove _get_root_mesh not in this PR.

torch/distributed/device_mesh.py

…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…h and rank_map ghstack-source-id: 93e45d9 Pull Request resolved: #166010

…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…h and rank_map ghstack-source-id: 2835343 Pull Request resolved: #166010

…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…h and rank_map ghstack-source-id: c5a3788 Pull Request resolved: #166010

…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…h and rank_map ghstack-source-id: 49b401d Pull Request resolved: #166010

…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…h and rank_map ghstack-source-id: 7a6f37e Pull Request resolved: #166010

kwen2501 · 2025-11-22T07:58:26Z

test/distributed/test_device_mesh.py

+        non_ep_mesh = global_mesh._unflatten(0, (2, 2, 2), ("dp", "cp", "tp"))
+        ep_mesh = global_mesh._unflatten(0, (2, 2, 2), ("dp", "ep", "ep_tp"))


What's the difference between these two lines vs a user giving multiple names to a dimension? Or a complex name such as "cp_or_ep"?

I think this is to generate a use case where we want to test PG cache. The cache is per layout so when multiple names are assigned to one dimension, the PG will be shared.

[WIP][DeviceMesh] Use a shared_state to cache pg per layout, root_mes…

24e0f27

…h and rank_map [ghstack-poisoned]

This was referenced Oct 21, 2025

[DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh #166003

Closed

[DeviceMesh] Implement a device mesh concatenate api for submesh and SPMD use case #163358

Closed

fduwjj mentioned this pull request Oct 21, 2025

[DeviceMesh][2D] Use concatenate for 2D (FSDP+TP) instead of getting from root mesh #165492

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 21, 2025

fduwjj added a commit that referenced this pull request Oct 21, 2025

[WIP][DeviceMesh] Use a shared_state to cache pg per layout, root_mes…

a7f993e

…h and rank_map ghstack-source-id: b50659e Pull Request resolved: #166010

Update on "[WIP][DeviceMesh] Use a shared_state to cache pg per layou…

602f1f4

…t, root_mesh and rank_map" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

fduwjj added a commit that referenced this pull request Oct 21, 2025

[WIP][DeviceMesh] Use a shared_state to cache pg per layout, root_mes…

2f7264f

…h and rank_map ghstack-source-id: 1cea7a4 Pull Request resolved: #166010

fduwjj requested review from fegin, lw and wconstab October 21, 2025 22:43

fduwjj added the release notes: DeviceMesh label Oct 21, 2025

fduwjj changed the title ~~[WIP][DeviceMesh] Use a shared_state to cache pg per layout, root_mesh and rank_map~~ [For discussion][DeviceMesh] Use a shared_state to cache pg per layout, root_mesh and rank_map Oct 21, 2025

fduwjj added a commit that referenced this pull request Oct 21, 2025

[WIP][DeviceMesh] Use a shared_state to cache pg per layout, root_mes…

ac1665c

…h and rank_map ghstack-source-id: 8865ebf Pull Request resolved: #166010

wconstab reviewed Oct 27, 2025

View reviewed changes

torch/distributed/device_mesh.py Outdated Show resolved Hide resolved

wconstab reviewed Oct 27, 2025

View reviewed changes

torch/distributed/device_mesh.py Show resolved Hide resolved

fduwjj added a commit that referenced this pull request Oct 27, 2025

[WIP][DeviceMesh] Use a shared_state to cache pg per layout, root_mes…

c2a517d

…h and rank_map ghstack-source-id: f96d104 Pull Request resolved: #166010

fduwjj added a commit that referenced this pull request Oct 28, 2025

[WIP][DeviceMesh] Use a shared_state to cache pg per layout, root_mes…

b44e68a

…h and rank_map ghstack-source-id: cffc81f Pull Request resolved: #166010

lw reviewed Oct 28, 2025

View reviewed changes

fduwjj requested a review from kwen2501 as a code owner October 30, 2025 23:27

fduwjj added a commit that referenced this pull request Oct 30, 2025

[WIP][DeviceMesh] Use a shared_state to cache pg per layout, root_mes…

6257149

…h and rank_map ghstack-source-id: 93e45d9 Pull Request resolved: #166010

fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 30, 2025

fduwjj requested review from lw and wconstab October 30, 2025 23:28

fduwjj added a commit that referenced this pull request Oct 30, 2025

[WIP][DeviceMesh] Use a shared_state to cache pg per layout, root_mes…

6aea369

…h and rank_map ghstack-source-id: 2835343 Pull Request resolved: #166010

This was referenced Nov 14, 2025

[Device Mesh] Add an option to decouple PGs when it comes device mesh save #167590

Open

[Device Mesh] Enable id based DeviceMesh universe with flag decouple_backend_at_save #167753

Open

fduwjj added a commit that referenced this pull request Nov 14, 2025

[WIP][DeviceMesh] Use a shared_state to cache pg per layout, root_mes…

6e6ef0a

…h and rank_map ghstack-source-id: c5a3788 Pull Request resolved: #166010

fduwjj added a commit that referenced this pull request Nov 14, 2025

[WIP][DeviceMesh] Use a shared_state to cache pg per layout, root_mes…

29b1ab2

…h and rank_map ghstack-source-id: 49b401d Pull Request resolved: #166010

fduwjj added a commit that referenced this pull request Nov 14, 2025

[WIP][DeviceMesh] Use a shared_state to cache pg per layout, root_mes…

83fd6f1

…h and rank_map ghstack-source-id: 7a6f37e Pull Request resolved: #166010

kwen2501 reviewed Nov 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[For discussion][DeviceMesh] Use a shared_state to cache pg per layout, root_mesh and rank_map #166010

[For discussion][DeviceMesh] Use a shared_state to cache pg per layout, root_mesh and rank_map #166010

Uh oh!

fduwjj commented Oct 21, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

lw left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lw Oct 28, 2025

Uh oh!

fduwjj Oct 30, 2025

Uh oh!

Uh oh!

Uh oh!

kwen2501 Nov 22, 2025

Uh oh!

fduwjj Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		non_ep_mesh = global_mesh._unflatten(0, (2, 2, 2), ("dp", "cp", "tp"))
		ep_mesh = global_mesh._unflatten(0, (2, 2, 2), ("dp", "ep", "ep_tp"))

[For discussion][DeviceMesh] Use a shared_state to cache pg per layout, root_mesh and rank_map #166010

Are you sure you want to change the base?

[For discussion][DeviceMesh] Use a shared_state to cache pg per layout, root_mesh and rank_map #166010

Uh oh!

Conversation

fduwjj commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166010

❌ 5 New Failures

Uh oh!

Uh oh!

Uh oh!

lw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lw Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kwen2501 Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fduwjj commented Oct 21, 2025 •

edited

Loading

pytorch-bot bot commented Oct 21, 2025 •

edited

Loading