[DeviceMesh] Move global state into class method #164510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

fduwjj wants to merge 15 commits into gh/fduwjj/217/base from gh/fduwjj/217/head

Contributor

fduwjj commented Oct 2, 2025 •

edited

Loading

Stack from ghstack (oldest at bottom):

-> [DeviceMesh] Move global state into class method #164510

This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.

3-5% CPU overhead in DTensor has been observed:

before:

After:

running the benchmark mentioned here: #159169

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci


          [DeviceMesh] Move global state into class method

4a09207

[ghstack-poisoned]

pytorch-bot bot commented Oct 2, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164510

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bd15c9d with merge base ffc9559 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fduwjj added a commit that referenced this pull request


          [DeviceMesh] Move global state into class method

28a4780

ghstack-source-id: 888edf7
Pull Request resolved: #164510

pytorch-bot bot added the oncall: distributed label


          Update on "[DeviceMesh] Move global state into class method"

5289f04

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request


          [DeviceMesh] Move global state into class method

9640c02

ghstack-source-id: e2a86f1
Pull Request resolved: #164510


          Update on "[DeviceMesh] Move global state into class method"

a17b012

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]

fduwjj mentioned this pull request

[DeviceMesh] Simplifying internal bookkeeping with CuTe layout #163213

Closed

fduwjj added a commit that referenced this pull request


          [DeviceMesh] Move global state into class method

15ed655

ghstack-source-id: ef28fea
Pull Request resolved: #164510

fduwjj added the release notes: DeviceMesh label

fduwjj requested review from fegin and lw

October 3, 2025 00:48


          Update on "[DeviceMesh] Move global state into class method"

73f995a

This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request


          [DeviceMesh] Move global state into class method

ba7e17f

ghstack-source-id: d19bb96
Pull Request resolved: #164510

lw approved these changes

View reviewed changes

Contributor

lw left a comment

This is great! I like this a lot! Thanks!

torch/distributed/device_mesh.py Outdated Show resolved Hide resolved

torch/distributed/device_mesh.py Outdated

Comment on lines 423 to 434

    
                      @property

                      def root_mesh(self) -> "DeviceMesh":

                          # If a mesh does not have a root mesh stored, it is a root mesh itself.

                          # A root mesh is not created through slicing.

                          return self._root_mesh if self._root_mesh else self

                      @root_mesh.setter

                      def root_mesh(self, mesh: Optional["DeviceMesh"]) -> None:

                          # you can add validation logic here if needed

                          if mesh is not None and not isinstance(mesh, DeviceMesh):

                              raise TypeError(f"Expected DeviceMesh or None, got {type(mesh)}")

                          self._root_mesh = mesh

Contributor

lw Oct 3, 2025

Could we make these attributes private? I'm still not convinced that the distinction between root and child mesh is really useful. At the end of the day, I believe, all we care about is which meshes belong to the same "universe" (i.e., which ones have the same shared state?).

While we discuss this and decide, I'd prefer if we avoid adding public API attributes which will limit us in the future.

Contributor

fegin Oct 3, 2025

Agree this. I also have concerns on other public methods. Please see the flatten one.

Contributor Author

fduwjj Oct 8, 2025

yes making it private and will have another PR for the idea for universe.

torch/distributed/device_mesh.py Show resolved Hide resolved

torch/distributed/device_mesh.py

    
                                      self.device_type,

                                      self.mesh_dim_names,

                                      self._thread_id,

                                      self._root_mesh,

Contributor

lw Oct 3, 2025

Here and in the equality operator below, could we just use the id(...) of the root mesh?

Contributor Author

fduwjj Oct 8, 2025

id(...) is the fast path, if the CPU overhead is not too big, I think we can still keep the old way. Once we have the universe thing, we can do id based comparison.

Contributor Author

fduwjj Oct 9, 2025

3-5% of CPU overhead regression was observed which should be fine, because this enables us for cleaner device mesh implementation.

torch/distributed/device_mesh.py Outdated Show resolved Hide resolved

torch/distributed/device_mesh.py Outdated Show resolved Hide resolved

torch/distributed/device_mesh.py Outdated Show resolved Hide resolved

torch/distributed/device_mesh.py Outdated Show resolved Hide resolved

fegin requested changes

View reviewed changes

Contributor

fegin left a comment

These methods should be private.

torch/distributed/device_mesh.py Show resolved Hide resolved

torch/distributed/device_mesh.py Outdated

Comment on lines 423 to 434

    
                      @property

                      def root_mesh(self) -> "DeviceMesh":

                          # If a mesh does not have a root mesh stored, it is a root mesh itself.

                          # A root mesh is not created through slicing.

                          return self._root_mesh if self._root_mesh else self

                      @root_mesh.setter

                      def root_mesh(self, mesh: Optional["DeviceMesh"]) -> None:

                          # you can add validation logic here if needed

                          if mesh is not None and not isinstance(mesh, DeviceMesh):

                              raise TypeError(f"Expected DeviceMesh or None, got {type(mesh)}")

                          self._root_mesh = mesh

Contributor

fegin Oct 3, 2025

Agree this. I also have concerns on other public methods. Please see the flatten one.

torch/distributed/device_mesh.py Show resolved Hide resolved

fduwjj mentioned this pull request

wrong returns of get_root_mesh in _mesh_resources #163330

Closed


          Update on "[DeviceMesh] Move global state into class method"

ef75929

This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request


          [DeviceMesh] Move global state into class method

ba6134b

ghstack-source-id: eeec49b
Pull Request resolved: #164510

fduwjj mentioned this pull request

[1/N] [DTensor device order] Device mesh util function to support device order placement #164797

Closed


          Update on "[DeviceMesh] Move global state into class method"

4d4a53d

This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]

fduwjj mentioned this pull request

[DeviceMesh] Make all members of DeviceMesh private and add public access API #164954

Closed

fduwjj added a commit that referenced this pull request


          [DeviceMesh] Move global state into class method

599a4f9

ghstack-source-id: 58060cd
Pull Request resolved: #164510


          Update on "[DeviceMesh] Move global state into class method"

c841f93

This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request


          [DeviceMesh] Move global state into class method

3cd1f1e

ghstack-source-id: ba9196b
Pull Request resolved: #164510

fduwjj requested a review from fegin

October 8, 2025 21:12


          Update on "[DeviceMesh] Move global state into class method"

260718e

This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]

fduwjj mentioned this pull request

[DeviceMesh] Add a warning for slicing flattened dim from root mesh and types for _get_slice_mesh_layout #164993

Closed

fduwjj added a commit that referenced this pull request


          [DeviceMesh] Move global state into class method

41105d8

ghstack-source-id: aa599d9
Pull Request resolved: #164510


          Update on "[DeviceMesh] Move global state into class method"

6653cb9

This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request


          Update on "[DeviceMesh] Clean up the call into mesh_resouces to get r…

47a12eb

…oot mesh"


We moved the method to get root mesh into class in #164510. This is to further clean code up.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request


          [DeviceMesh] Clean up the call into mesh_resouces to get root mesh (#…

7406d2e

…165787)

We moved the method to get root mesh into class in #164510. This is to further clean code up.

Differential Revision: [D85090191](https://our.internmc.facebook.com/intern/diff/D85090191)
Pull Request resolved: #165787
Approved by: https://github.com/fegin

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request


          [DeviceMesh] Move global state into class method (pytorch#164510)

d5c1360

This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.

3-5% CPU overhead in DTensor has been observed:

before:
<img width="1147" height="535" alt="image" src="https://github.com/user-attachments/assets/9e4ac018-ec0a-46a4-8f2c-64b4dbec465c" />

After:
<img width="1114" height="576" alt="image" src="https://github.com/user-attachments/assets/eaf83660-652b-4c6b-8591-f6049ccdd14c" />

running the benchmark mentioned here: pytorch#159169

Pull Request resolved: pytorch#164510
Approved by: https://github.com/lw, https://github.com/fegin

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request


          [device_mesh] Implement _unflatten on top of CuTe layout bookkeeping (

eecfaac

pytorch#161224)

Pull Request resolved: pytorch#161224
Approved by: https://github.com/lw, https://github.com/fegin
ghstack dependencies: pytorch#164510

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request


          [DeviceMesh] Clean up the call into mesh_resouces to get root mesh (p…

368e665

…ytorch#165787)

We moved the method to get root mesh into class in pytorch#164510. This is to further clean code up.

Differential Revision: [D85090191](https://our.internmc.facebook.com/intern/diff/D85090191)
Pull Request resolved: pytorch#165787
Approved by: https://github.com/fegin

fduwjj mentioned this pull request

[DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh #166003

Closed

zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request


          [DeviceMesh] Clean up the call into mesh_resouces to get root mesh (p…

b09d835

…ytorch#165787)

We moved the method to get root mesh into class in pytorch#164510. This is to further clean code up.

Differential Revision: [D85090191](https://our.internmc.facebook.com/intern/diff/D85090191)
Pull Request resolved: pytorch#165787
Approved by: https://github.com/fegin

pytorchmergebot pushed a commit that referenced this pull request


          [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so t…

8625ffb

…hat we don't need to compare root mesh (#166003)

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

Pull Request resolved: #166003
Approved by: https://github.com/Skylion007, https://github.com/fegin

fduwjj added a commit that referenced this pull request


          [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so t…

41c5668

…hat we don't need to compare root mesh (#166003)

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

Pull Request resolved: #166003
Approved by: https://github.com/Skylion007, https://github.com/fegin


Internal:
<< DO NOT EDIT BELOW THIS LINE >>

**GitHub Author**: fduwjj <fduwjj@gmail.com> (Meta Employee)
**GitHub Repo**: [pytorch/pytorch](https://github.com/pytorch/pytorch)
**GitHub Pull Request**: [#166003](#166003)

Initially generated by: https://www.internalfb.com/intern/sandcastle/job/9007201528851998/

This was imported as part of a Diff Train.
Please review this as soon as possible. Since it is a direct copy of a commit on
GitHub, there shouldn't be much to do.

Below line forces Sandcastle to run only specified contbuilds.
@build_only[github-export-checks,executorch,pytorch_benchmark,pytorch_benchmark_fb,pytorch_quantization,pytorch_distributed,pytorch_distributed_gpu,pytorch_dynamo,pytorch_inductor,pytorch_inductor_fb,pytorch_functorch,pytorch_fx2trt,pytorch_diff_train_tests_ads,glow_fb_pytorch_tests,training_platform,training_platform_compatibility,training_toolkit_applications,training_toolkit_examples,training_toolkit_model_optimization,dper3_pytorch,xplat_caffe2,pytorch_dev,android-pytorch-instrumentation-tests,smart__pytorch__github_first_try_merge,frl-target-determinator,f6-buck,training_platform_for_github,sigmoid_cpu,sigmoid_gpu,aiplatform_modelprocessing_for_github,accelerators_workloads_models_slimdsnn,ae_aotinductor_benchmark_test,aps_,apf,aps_deterministic_ne_tests,dper_lib_silvertorch,torchrec,torchrec_fb,deeplearning_aot_inductor,aiplatform_modelstore]
#skipfbcodelongtail
#disable_code_coverage
@pytorch-oss-diff-train

diff-train-source-id: 8625ffb

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822/)
ghstack-source-id: 318594805

fduwjj added a commit that referenced this pull request


          Update base for Update on "[DeviceMesh] Use _flatten_rank_map to repl…

ee6837f

…ace _flatten_mesh_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request


          Update on "[DeviceMesh] Use _flatten_rank_map to replace _flatten_mes…

5ae0b13

…h_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request


          Update base for Update on "[DeviceMesh] Use _flatten_rank_map to repl…

133e6fc

…ace _flatten_mesh_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request


          Update on "[DeviceMesh] Use _flatten_rank_map to replace _flatten_mes…

49be50a

…h_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request


          [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so t…

6c4d177

…hat we don't need to compare root mesh (#166003)

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

Pull Request resolved: #166003
Approved by: https://github.com/Skylion007, https://github.com/fegin


Internal:
<< DO NOT EDIT BELOW THIS LINE >>

**GitHub Author**: fduwjj <fduwjj@gmail.com> (Meta Employee)
**GitHub Repo**: [pytorch/pytorch](https://github.com/pytorch/pytorch)
**GitHub Pull Request**: [#166003](#166003)

Initially generated by: https://www.internalfb.com/intern/sandcastle/job/9007201528851998/

This was imported as part of a Diff Train.
Please review this as soon as possible. Since it is a direct copy of a commit on
GitHub, there shouldn't be much to do.

Below line forces Sandcastle to run only specified contbuilds.
@build_only[github-export-checks,executorch,pytorch_benchmark,pytorch_benchmark_fb,pytorch_quantization,pytorch_distributed,pytorch_distributed_gpu,pytorch_dynamo,pytorch_inductor,pytorch_inductor_fb,pytorch_functorch,pytorch_fx2trt,pytorch_diff_train_tests_ads,glow_fb_pytorch_tests,training_platform,training_platform_compatibility,training_toolkit_applications,training_toolkit_examples,training_toolkit_model_optimization,dper3_pytorch,xplat_caffe2,pytorch_dev,android-pytorch-instrumentation-tests,smart__pytorch__github_first_try_merge,frl-target-determinator,f6-buck,training_platform_for_github,sigmoid_cpu,sigmoid_gpu,aiplatform_modelprocessing_for_github,accelerators_workloads_models_slimdsnn,ae_aotinductor_benchmark_test,aps_,apf,aps_deterministic_ne_tests,dper_lib_silvertorch,torchrec,torchrec_fb,deeplearning_aot_inductor,aiplatform_modelstore]
#skipfbcodelongtail
#disable_code_coverage
@pytorch-oss-diff-train

diff-train-source-id: 8625ffb
ghstack-source-id: 318681631

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822/)

fduwjj added a commit that referenced this pull request


          Update base for Update on "[DeviceMesh] Use _flatten_rank_map to repl…

7bc5af4

…ace _flatten_mesh_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request


          [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so t…

a763f98

…hat we don't need to compare root mesh (#166003)

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

Pull Request resolved: #166003
Approved by: https://github.com/Skylion007, https://github.com/fegin

Internal:
<< DO NOT EDIT BELOW THIS LINE >>

**GitHub Author**: fduwjj <fduwjj@gmail.com> (Meta Employee)
**GitHub Repo**: [pytorch/pytorch](https://github.com/pytorch/pytorch)
**GitHub Pull Request**: [#166003](#166003)

Initially generated by: https://www.internalfb.com/intern/sandcastle/job/9007201528851998/

This was imported as part of a Diff Train.
Please review this as soon as possible. Since it is a direct copy of a commit on
GitHub, there shouldn't be much to do.

Below line forces Sandcastle to run only specified contbuilds.
@build_only[github-export-checks,executorch,pytorch_benchmark,pytorch_benchmark_fb,pytorch_quantization,pytorch_distributed,pytorch_distributed_gpu,pytorch_dynamo,pytorch_inductor,pytorch_inductor_fb,pytorch_functorch,pytorch_fx2trt,pytorch_diff_train_tests_ads,glow_fb_pytorch_tests,training_platform,training_platform_compatibility,training_toolkit_applications,training_toolkit_examples,training_toolkit_model_optimization,dper3_pytorch,xplat_caffe2,pytorch_dev,android-pytorch-instrumentation-tests,smart__pytorch__github_first_try_merge,frl-target-determinator,f6-buck,training_platform_for_github,sigmoid_cpu,sigmoid_gpu,aiplatform_modelprocessing_for_github,accelerators_workloads_models_slimdsnn,ae_aotinductor_benchmark_test,aps_,apf,aps_deterministic_ne_tests,dper_lib_silvertorch,torchrec,torchrec_fb,deeplearning_aot_inductor,aiplatform_modelstore]
#skipfbcodelongtail
#disable_code_coverage
@pytorch-oss-diff-train

diff-train-source-id: 8625ffb
ghstack-source-id: 79a788c

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822/)

fduwjj added a commit that referenced this pull request


          Update on "[DeviceMesh] Use _flatten_rank_map to replace _flatten_mes…

211b67f

…h_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request


          Update base for Update on "[DeviceMesh] Use _flatten_rank_map to repl…

80ec8a4

…ace _flatten_mesh_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request


          Update on "[DeviceMesh] Use _flatten_rank_map to replace _flatten_mes…

666b105

…h_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request


          [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so t…

0e77a4d

…hat we don't need to compare root mesh (#166003)

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

Pull Request resolved: #166003
Approved by: https://github.com/Skylion007, https://github.com/fegin


Internal:
<< DO NOT EDIT BELOW THIS LINE >>

**GitHub Author**: fduwjj <fduwjj@gmail.com> (Meta Employee)
**GitHub Repo**: [pytorch/pytorch](https://github.com/pytorch/pytorch)
**GitHub Pull Request**: [#166003](#166003)

Initially generated by: https://www.internalfb.com/intern/sandcastle/job/9007201528851998/

This was imported as part of a Diff Train.
Please review this as soon as possible. Since it is a direct copy of a commit on
GitHub, there shouldn't be much to do.

Below line forces Sandcastle to run only specified contbuilds.
@build_only[github-export-checks,executorch,pytorch_benchmark,pytorch_benchmark_fb,pytorch_quantization,pytorch_distributed,pytorch_distributed_gpu,pytorch_dynamo,pytorch_inductor,pytorch_inductor_fb,pytorch_functorch,pytorch_fx2trt,pytorch_diff_train_tests_ads,glow_fb_pytorch_tests,training_platform,training_platform_compatibility,training_toolkit_applications,training_toolkit_examples,training_toolkit_model_optimization,dper3_pytorch,xplat_caffe2,pytorch_dev,android-pytorch-instrumentation-tests,smart__pytorch__github_first_try_merge,frl-target-determinator,f6-buck,training_platform_for_github,sigmoid_cpu,sigmoid_gpu,aiplatform_modelprocessing_for_github,accelerators_workloads_models_slimdsnn,ae_aotinductor_benchmark_test,aps_,apf,aps_deterministic_ne_tests,dper_lib_silvertorch,torchrec,torchrec_fb,deeplearning_aot_inductor,aiplatform_modelstore]
#skipfbcodelongtail
#disable_code_coverage
@pytorch-oss-diff-train

diff-train-source-id: 8625ffb
ghstack-source-id: 318735710

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822/)

fduwjj mentioned this pull request

[DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh (#166003) #166264

Closed

pytorch-bot bot pushed a commit that referenced this pull request


          [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so t…

71f14a1

…hat we don't need to compare root mesh (#166003)

Summary:


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci


imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

pytorchmergebot pushed a commit that referenced this pull request


          [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so t…

000f495

…hat we don't need to compare root mesh (#166003) (#166264)

Summary:

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

Pull Request resolved: #166264
Approved by: https://github.com/XilunWu

yiming0416 pushed a commit that referenced this pull request


          [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so t…

86ffadd

…hat we don't need to compare root mesh (#166003) (#166264)

Summary:

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

Pull Request resolved: #166264
Approved by: https://github.com/XilunWu

yiming0416 pushed a commit that referenced this pull request


          [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so t…

7b30ec2

…hat we don't need to compare root mesh (#166003) (#166264)

Summary:

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

Pull Request resolved: #166264
Approved by: https://github.com/XilunWu

yiming0416 pushed a commit that referenced this pull request


          [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so t…

bedf443

…hat we don't need to compare root mesh (#166003) (#166264)

Summary:

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

Pull Request resolved: #166264
Approved by: https://github.com/XilunWu

yiming0416 pushed a commit that referenced this pull request


          [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so t…

4d2b8e8

…hat we don't need to compare root mesh (#166003) (#166264)

Summary:

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

Pull Request resolved: #166264
Approved by: https://github.com/XilunWu

tianrengao pushed a commit that referenced this pull request


          [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so t…

708ad71

…hat we don't need to compare root mesh (#166003) (#166264)

Summary:

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

Pull Request resolved: #166264
Approved by: https://github.com/XilunWu

github-actions bot deleted the gh/fduwjj/217/head branch

November 10, 2025 02:21

cspades mentioned this pull request

Various small fixes for Megatron-FSDP. NVIDIA/Megatron-LM#2346

Merged

6 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Merged oncall: distributed release notes: DeviceMesh release notes: distributed (checkpoint)