-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[For discussion][DeviceMesh] Use a shared_state to cache pg per layout, root_mesh and rank_map #166010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/fduwjj/229/base
Are you sure you want to change the base?
Conversation
…h and rank_map [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166010
Note: Links to docs will display an error until the docs builds have been completed. ❌ 5 New FailuresAs of commit 54abae7 with merge base a5f3035 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…t, root_mesh and rank_map" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
lw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I think this PR is a great starting point for the discussion around PG caching, and defining a clear mental model for how different DeviceMesh should share state.
torch/distributed/device_mesh.py
Outdated
| def _get_root_mesh(self) -> "DeviceMesh": | ||
| return self._root_mesh if self._root_mesh else self | ||
| return not_none(self._shared_state.get_root_mesh()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only instance where we're accessing the root mesh of the shared state. In principle, once we introduce the concept of shared state, we can get rid of the concept of root meshes.
What is preventing us from simply removing this _get_root_mesh method?
If there's any internal usage of _get_root_mesh, could we find ways of codemod it away so that we can fully remove this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need more PRs to remove _get_root_mesh not in this PR.
…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…g per layout, root_mesh and rank_map" We want to create a shared_state to store root_mesh, rank_map and pg caches. We can add more into it down the road, so that it becomes a singleton for bookkeeping and also align with our original proposal to move toward the idea of mesh universe. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
| non_ep_mesh = global_mesh._unflatten(0, (2, 2, 2), ("dp", "cp", "tp")) | ||
| ep_mesh = global_mesh._unflatten(0, (2, 2, 2), ("dp", "ep", "ep_tp")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between these two lines vs a user giving multiple names to a dimension? Or a complex name such as "cp_or_ep"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is to generate a use case where we want to test PG cache. The cache is per layout so when multiple names are assigned to one dimension, the PG will be shared.
Stack from ghstack (oldest at bottom):
To avoid creating extra PGs when not needed, (for example when calling unflatten many times and some dims share the same layout), we want to create a pg cache mechanism. We cache a PG by a pair of layout (_MeshLayout) and its pg_option, so if users flatten or unflatten into the same layout+pg option (by default pg_option will be None if not set, that way the cache key falls back to layout itself) we will not create process group if already created.
Also to further consolidate the bookkeeping of DM, we created a device_type, shared_state to store root_mesh, rank_map and pg caches. This way all shared info for a given device mesh universe now becomes a singleton.
cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci