Skip to content

Conversation

@fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Nov 13, 2025

Stack from ghstack (oldest at bottom):

We always want to enable ID based rank_map to differentiate device mesh created within one universe or different universes. With the flag which decouples PG infos from device mesh save, we are now ready to enable this. (We tried to do this in #165680 and https://github.com/pytorch/pytorch/pull/166689/files but both seems to be BC breaking so we gate within this new flag added in #167590)

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 13, 2025
fduwjj added a commit that referenced this pull request Nov 13, 2025
…backend_at_save

ghstack-source-id: bf1746b
Pull Request resolved: #167753
@pytorch-bot pytorch-bot bot removed the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 13, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 13, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 13, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 13, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 13, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167753

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bae2cbc with merge base a5f3035 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 14, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 14, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@fduwjj fduwjj added release notes: DeviceMesh ciflow/trunk Trigger trunk jobs on your pull request labels Nov 14, 2025
@fduwjj fduwjj requested review from fegin and lw November 14, 2025 17:39
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 14, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 14, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@wconstab
Copy link
Contributor

We always want to enable ID based rank_map to differentiate device mesh created within one universe or different universes.

I could use a more detailed motivation/explanation in the PR desc. Maybe what you are saying is

  1. You want to change the definition of DeviceMesh.eq to compare id(rank_map) instead of values of _flat_dim_map. (Why does this help?)
  2. It is hard to do (1) because there is some connection between _flat_dim_map being in eq and BC preserved torch.save/torch.load behavior. (What is the connection though? [Device Mesh] Add an option to decouple PGs when it comes device mesh save #167590 seems to focus mostly on 'dim_group_names' attr.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request release notes: DeviceMesh

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants