[Device Mesh] Add an option to decouple PGs when it comes device mesh save #167590

fduwjj · 2025-11-11T23:47:07Z

Stack from ghstack (oldest at bottom):

[For discussion][DeviceMesh] Use a shared_state to cache pg per layout, root_mesh and rank_map #166010
[Device Mesh] Enable id based DeviceMesh universe with flag decouple_backend_at_save #167753
-> [Device Mesh] Add an option to decouple PGs when it comes device mesh save #167590
[Device Mesh][ez] Clean up unused parameters and duplicate codes #167581

The rationale behind this PR is we want to create a module level flag which decouples PG info (names) during torch.save and torch.load for DeviceMesh (and DTensor) The reason is that we want users to explicitly create PGs (or deviceMesh) while do the torch.load instead of reusing the PG name saved. Because if users don't create PG or created the PG in the wrong order, the loaded device mesh will not be working.

Also, we know directly changing this behavior is BC breaking, so we add a flag and warning messages for it, so that we will clean it up later on.

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

… save [ghstack-poisoned]

pytorch-bot · 2025-11-11T23:47:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167590

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 61f5996 with merge base a5f3035 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

… save ghstack-source-id: 9e97de0 Pull Request resolved: #167590

fegin

Let's add an unit test to demonstrate torch.save/torch.load when decouple_backend_at_save is True.

fegin · 2025-11-12T17:57:09Z

torch/distributed/device_mesh.py

+            if state.get("dim_group_names"):
+                self._dim_group_names = state["dim_group_names"]


This is the key step. We should add some comment here. How do users attach the PG after torch.load?

i'm also unclear on this. If we landed this PR as-is, then how is someone supposed to use it?

Maybe we should separate this into 2 PRs

just warn in getstate to discourage people from using it, explain the risk, and point to DCP docs

the rest of this PR, but also include better usage example and how to 'bind' the loaded DTensor into a new mesh.

I am not sure how high prio (2) is, but if we do it, we should do it right and have a good doc + UX for it

…device mesh save" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

… save ghstack-source-id: e77c899 Pull Request resolved: #167590

wconstab · 2025-11-19T05:01:32Z

torch/distributed/device_mesh.py

+            if not self.decouple_backend_at_save and hasattr(self, "_dim_group_names"):
+                logger.warning(
+                    "Save device mesh via torch.save with pg names and will be deprecated in PT 2.11. "
+                    "Users are welcome to use Distributed checkpoint (DCP) or re-create pgs in the same order"


This comment is probably not detailed enough to be helpful.

welcome to use DCP: to be clear, this suggestion is that the user rewrite their flow, though, we should at least point to a doc or tutorial

re-create pgs in the same order: this suggestion is not detailed enough to be actionable IMO.

How about this?
"Starting in PyTorch 2.11, torch.save will save a DeviceMesh without including ProcessGroup information, and loading a saved DeviceMesh will require manually recreating the same configuration of ProcessGroups and binding them to the loaded DeviceMesh. See <link to example> for more information on how to do this. Alternatively, use DCP <link to tutorial> to save and load DTensors in a format that supports resharding and can be loaded on a different mesh configuration."

[Device Mesh] Add an option to decouple PGs when it comes device mesh…

ec3befc

… save [ghstack-poisoned]

fduwjj mentioned this pull request Nov 11, 2025

[Device Mesh][ez] Clean up unused parameters and duplicate codes #167581

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 11, 2025

fduwjj added a commit that referenced this pull request Nov 11, 2025

[Device Mesh] Add an option to decouple PGs when it comes device mesh…

6803145

… save ghstack-source-id: 9e97de0 Pull Request resolved: #167590

fduwjj added release notes: DeviceMesh ciflow/trunk Trigger trunk jobs on your pull request labels Nov 11, 2025

fegin reviewed Nov 12, 2025

View reviewed changes

Update on "[Device Mesh] Add an option to decouple PGs when it comes …

61f5996

…device mesh save" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

fduwjj added a commit that referenced this pull request Nov 13, 2025

[Device Mesh] Add an option to decouple PGs when it comes device mesh…

8702d78

… save ghstack-source-id: e77c899 Pull Request resolved: #167590

fduwjj requested a review from fegin November 13, 2025 00:28

This was referenced Nov 13, 2025

[Device Mesh] Enable id based DeviceMesh universe with flag decouple_backend_at_save #167753

Open

[For discussion][DeviceMesh] Use a shared_state to cache pg per layout, root_mesh and rank_map #166010

Open

fduwjj requested a review from lw November 14, 2025 17:39

wconstab reviewed Nov 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Device Mesh] Add an option to decouple PGs when it comes device mesh save #167590

[Device Mesh] Add an option to decouple PGs when it comes device mesh save #167590

Uh oh!

fduwjj commented Nov 11, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 11, 2025 •

edited

Loading

Uh oh!

fegin left a comment

Uh oh!

fegin Nov 12, 2025

Uh oh!

wconstab Dec 16, 2025

Uh oh!

wconstab Nov 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		if state.get("dim_group_names"):
		self._dim_group_names = state["dim_group_names"]

[Device Mesh] Add an option to decouple PGs when it comes device mesh save #167590

Are you sure you want to change the base?

[Device Mesh] Add an option to decouple PGs when it comes device mesh save #167590

Uh oh!

Conversation

fduwjj commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167590

✅ No Failures

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

fegin Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

wconstab Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

wconstab Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fduwjj commented Nov 11, 2025 •

edited

Loading

pytorch-bot bot commented Nov 11, 2025 •

edited

Loading

wconstab Nov 19, 2025 •

edited

Loading