Skip to content

Conversation

@fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Oct 2, 2025

Stack from ghstack (oldest at bottom):

This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.

3-5% CPU overhead in DTensor has been observed:

before:
image

After:
image

running the benchmark mentioned here: #159169

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164510

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bd15c9d with merge base ffc9559 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fduwjj added a commit that referenced this pull request Oct 2, 2025
ghstack-source-id: 888edf7
Pull Request resolved: #164510
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 2, 2025
cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 2, 2025
ghstack-source-id: e2a86f1
Pull Request resolved: #164510
cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 2, 2025
ghstack-source-id: ef28fea
Pull Request resolved: #164510
@fduwjj fduwjj requested review from fegin and lw October 3, 2025 00:48
This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 3, 2025
ghstack-source-id: d19bb96
Pull Request resolved: #164510
Copy link
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I like this a lot! Thanks!

Comment on lines 423 to 434
@property
def root_mesh(self) -> "DeviceMesh":
# If a mesh does not have a root mesh stored, it is a root mesh itself.
# A root mesh is not created through slicing.
return self._root_mesh if self._root_mesh else self

@root_mesh.setter
def root_mesh(self, mesh: Optional["DeviceMesh"]) -> None:
# you can add validation logic here if needed
if mesh is not None and not isinstance(mesh, DeviceMesh):
raise TypeError(f"Expected DeviceMesh or None, got {type(mesh)}")
self._root_mesh = mesh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make these attributes private? I'm still not convinced that the distinction between root and child mesh is really useful. At the end of the day, I believe, all we care about is which meshes belong to the same "universe" (i.e., which ones have the same shared state?).

While we discuss this and decide, I'd prefer if we avoid adding public API attributes which will limit us in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this. I also have concerns on other public methods. Please see the flatten one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes making it private and will have another PR for the idea for universe.

self.device_type,
self.mesh_dim_names,
self._thread_id,
self._root_mesh,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in the equality operator below, could we just use the id(...) of the root mesh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id(...) is the fast path, if the CPU overhead is not too big, I think we can still keep the old way. Once we have the universe thing, we can do id based comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3-5% of CPU overhead regression was observed which should be fine, because this enables us for cleaner device mesh implementation.

Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods should be private.

Comment on lines 423 to 434
@property
def root_mesh(self) -> "DeviceMesh":
# If a mesh does not have a root mesh stored, it is a root mesh itself.
# A root mesh is not created through slicing.
return self._root_mesh if self._root_mesh else self

@root_mesh.setter
def root_mesh(self, mesh: Optional["DeviceMesh"]) -> None:
# you can add validation logic here if needed
if mesh is not None and not isinstance(mesh, DeviceMesh):
raise TypeError(f"Expected DeviceMesh or None, got {type(mesh)}")
self._root_mesh = mesh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this. I also have concerns on other public methods. Please see the flatten one.

This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 8, 2025
ghstack-source-id: 58060cd
Pull Request resolved: #164510
This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 8, 2025
ghstack-source-id: ba9196b
Pull Request resolved: #164510
@fduwjj fduwjj requested a review from fegin October 8, 2025 21:12
This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 20, 2025
…oot mesh"


We moved the method to get root mesh into class in #164510. This is to further clean code up.


cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Oct 21, 2025
…165787)

We moved the method to get root mesh into class in #164510. This is to further clean code up.

Differential Revision: [D85090191](https://our.internmc.facebook.com/intern/diff/D85090191)
Pull Request resolved: #165787
Approved by: https://github.com/fegin
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.

3-5% CPU overhead in DTensor has been observed:

before:
<img width="1147" height="535" alt="image" src="https://github.com/user-attachments/assets/9e4ac018-ec0a-46a4-8f2c-64b4dbec465c" />

After:
<img width="1114" height="576" alt="image" src="https://github.com/user-attachments/assets/eaf83660-652b-4c6b-8591-f6049ccdd14c" />

running the benchmark mentioned here: pytorch#159169

Pull Request resolved: pytorch#164510
Approved by: https://github.com/lw, https://github.com/fegin
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
…ytorch#165787)

We moved the method to get root mesh into class in pytorch#164510. This is to further clean code up.

Differential Revision: [D85090191](https://our.internmc.facebook.com/intern/diff/D85090191)
Pull Request resolved: pytorch#165787
Approved by: https://github.com/fegin
zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request Oct 22, 2025
…ytorch#165787)

We moved the method to get root mesh into class in pytorch#164510. This is to further clean code up.

Differential Revision: [D85090191](https://our.internmc.facebook.com/intern/diff/D85090191)
Pull Request resolved: pytorch#165787
Approved by: https://github.com/fegin
pytorchmergebot pushed a commit that referenced this pull request Oct 23, 2025
…hat we don't need to compare root mesh (#166003)

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

Pull Request resolved: #166003
Approved by: https://github.com/Skylion007, https://github.com/fegin
fduwjj added a commit that referenced this pull request Oct 25, 2025
…hat we don't need to compare root mesh (#166003)

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

Pull Request resolved: #166003
Approved by: https://github.com/Skylion007, https://github.com/fegin


Internal:
<< DO NOT EDIT BELOW THIS LINE >>

**GitHub Author**: fduwjj <fduwjj@gmail.com> (Meta Employee)
**GitHub Repo**: [pytorch/pytorch](https://github.com/pytorch/pytorch)
**GitHub Pull Request**: [#166003](#166003)

Initially generated by: https://www.internalfb.com/intern/sandcastle/job/9007201528851998/

This was imported as part of a Diff Train.
Please review this as soon as possible. Since it is a direct copy of a commit on
GitHub, there shouldn't be much to do.

Below line forces Sandcastle to run only specified contbuilds.
@build_only[github-export-checks,executorch,pytorch_benchmark,pytorch_benchmark_fb,pytorch_quantization,pytorch_distributed,pytorch_distributed_gpu,pytorch_dynamo,pytorch_inductor,pytorch_inductor_fb,pytorch_functorch,pytorch_fx2trt,pytorch_diff_train_tests_ads,glow_fb_pytorch_tests,training_platform,training_platform_compatibility,training_toolkit_applications,training_toolkit_examples,training_toolkit_model_optimization,dper3_pytorch,xplat_caffe2,pytorch_dev,android-pytorch-instrumentation-tests,smart__pytorch__github_first_try_merge,frl-target-determinator,f6-buck,training_platform_for_github,sigmoid_cpu,sigmoid_gpu,aiplatform_modelprocessing_for_github,accelerators_workloads_models_slimdsnn,ae_aotinductor_benchmark_test,aps_,apf,aps_deterministic_ne_tests,dper_lib_silvertorch,torchrec,torchrec_fb,deeplearning_aot_inductor,aiplatform_modelstore]
#skipfbcodelongtail
#disable_code_coverage
@pytorch-oss-diff-train

diff-train-source-id: 8625ffb

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822/)
ghstack-source-id: 318594805
fduwjj added a commit that referenced this pull request Oct 25, 2025
…ace _flatten_mesh_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 25, 2025
…h_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 25, 2025
…ace _flatten_mesh_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 25, 2025
…h_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 25, 2025
…hat we don't need to compare root mesh (#166003)

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

Pull Request resolved: #166003
Approved by: https://github.com/Skylion007, https://github.com/fegin


Internal:
<< DO NOT EDIT BELOW THIS LINE >>

**GitHub Author**: fduwjj <fduwjj@gmail.com> (Meta Employee)
**GitHub Repo**: [pytorch/pytorch](https://github.com/pytorch/pytorch)
**GitHub Pull Request**: [#166003](#166003)

Initially generated by: https://www.internalfb.com/intern/sandcastle/job/9007201528851998/

This was imported as part of a Diff Train.
Please review this as soon as possible. Since it is a direct copy of a commit on
GitHub, there shouldn't be much to do.

Below line forces Sandcastle to run only specified contbuilds.
@build_only[github-export-checks,executorch,pytorch_benchmark,pytorch_benchmark_fb,pytorch_quantization,pytorch_distributed,pytorch_distributed_gpu,pytorch_dynamo,pytorch_inductor,pytorch_inductor_fb,pytorch_functorch,pytorch_fx2trt,pytorch_diff_train_tests_ads,glow_fb_pytorch_tests,training_platform,training_platform_compatibility,training_toolkit_applications,training_toolkit_examples,training_toolkit_model_optimization,dper3_pytorch,xplat_caffe2,pytorch_dev,android-pytorch-instrumentation-tests,smart__pytorch__github_first_try_merge,frl-target-determinator,f6-buck,training_platform_for_github,sigmoid_cpu,sigmoid_gpu,aiplatform_modelprocessing_for_github,accelerators_workloads_models_slimdsnn,ae_aotinductor_benchmark_test,aps_,apf,aps_deterministic_ne_tests,dper_lib_silvertorch,torchrec,torchrec_fb,deeplearning_aot_inductor,aiplatform_modelstore]
#skipfbcodelongtail
#disable_code_coverage
@pytorch-oss-diff-train

diff-train-source-id: 8625ffb
ghstack-source-id: 318681631

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822/)
fduwjj added a commit that referenced this pull request Oct 25, 2025
…ace _flatten_mesh_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 25, 2025
…hat we don't need to compare root mesh (#166003)

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

Pull Request resolved: #166003
Approved by: https://github.com/Skylion007, https://github.com/fegin

Internal:
<< DO NOT EDIT BELOW THIS LINE >>

**GitHub Author**: fduwjj <fduwjj@gmail.com> (Meta Employee)
**GitHub Repo**: [pytorch/pytorch](https://github.com/pytorch/pytorch)
**GitHub Pull Request**: [#166003](#166003)

Initially generated by: https://www.internalfb.com/intern/sandcastle/job/9007201528851998/

This was imported as part of a Diff Train.
Please review this as soon as possible. Since it is a direct copy of a commit on
GitHub, there shouldn't be much to do.

Below line forces Sandcastle to run only specified contbuilds.
@build_only[github-export-checks,executorch,pytorch_benchmark,pytorch_benchmark_fb,pytorch_quantization,pytorch_distributed,pytorch_distributed_gpu,pytorch_dynamo,pytorch_inductor,pytorch_inductor_fb,pytorch_functorch,pytorch_fx2trt,pytorch_diff_train_tests_ads,glow_fb_pytorch_tests,training_platform,training_platform_compatibility,training_toolkit_applications,training_toolkit_examples,training_toolkit_model_optimization,dper3_pytorch,xplat_caffe2,pytorch_dev,android-pytorch-instrumentation-tests,smart__pytorch__github_first_try_merge,frl-target-determinator,f6-buck,training_platform_for_github,sigmoid_cpu,sigmoid_gpu,aiplatform_modelprocessing_for_github,accelerators_workloads_models_slimdsnn,ae_aotinductor_benchmark_test,aps_,apf,aps_deterministic_ne_tests,dper_lib_silvertorch,torchrec,torchrec_fb,deeplearning_aot_inductor,aiplatform_modelstore]
#skipfbcodelongtail
#disable_code_coverage
@pytorch-oss-diff-train

diff-train-source-id: 8625ffb
ghstack-source-id: 79a788c

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822/)
fduwjj added a commit that referenced this pull request Oct 25, 2025
…h_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 26, 2025
…ace _flatten_mesh_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 26, 2025
…h_list so that we don't need to compare root mesh"


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822)

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Oct 26, 2025
…hat we don't need to compare root mesh (#166003)

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

Pull Request resolved: #166003
Approved by: https://github.com/Skylion007, https://github.com/fegin


Internal:
<< DO NOT EDIT BELOW THIS LINE >>

**GitHub Author**: fduwjj <fduwjj@gmail.com> (Meta Employee)
**GitHub Repo**: [pytorch/pytorch](https://github.com/pytorch/pytorch)
**GitHub Pull Request**: [#166003](#166003)

Initially generated by: https://www.internalfb.com/intern/sandcastle/job/9007201528851998/

This was imported as part of a Diff Train.
Please review this as soon as possible. Since it is a direct copy of a commit on
GitHub, there shouldn't be much to do.

Below line forces Sandcastle to run only specified contbuilds.
@build_only[github-export-checks,executorch,pytorch_benchmark,pytorch_benchmark_fb,pytorch_quantization,pytorch_distributed,pytorch_distributed_gpu,pytorch_dynamo,pytorch_inductor,pytorch_inductor_fb,pytorch_functorch,pytorch_fx2trt,pytorch_diff_train_tests_ads,glow_fb_pytorch_tests,training_platform,training_platform_compatibility,training_toolkit_applications,training_toolkit_examples,training_toolkit_model_optimization,dper3_pytorch,xplat_caffe2,pytorch_dev,android-pytorch-instrumentation-tests,smart__pytorch__github_first_try_merge,frl-target-determinator,f6-buck,training_platform_for_github,sigmoid_cpu,sigmoid_gpu,aiplatform_modelprocessing_for_github,accelerators_workloads_models_slimdsnn,ae_aotinductor_benchmark_test,aps_,apf,aps_deterministic_ne_tests,dper_lib_silvertorch,torchrec,torchrec_fb,deeplearning_aot_inductor,aiplatform_modelstore]
#skipfbcodelongtail
#disable_code_coverage
@pytorch-oss-diff-train

diff-train-source-id: 8625ffb
ghstack-source-id: 318735710

Differential Revision: [D85394822](https://our.internmc.facebook.com/intern/diff/D85394822/)
pytorch-bot bot pushed a commit that referenced this pull request Oct 26, 2025
…hat we don't need to compare root mesh (#166003)

Summary:


Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci


imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj
pytorchmergebot pushed a commit that referenced this pull request Oct 27, 2025
…hat we don't need to compare root mesh (#166003) (#166264)

Summary:

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

Pull Request resolved: #166264
Approved by: https://github.com/XilunWu
yiming0416 pushed a commit that referenced this pull request Oct 27, 2025
…hat we don't need to compare root mesh (#166003) (#166264)

Summary:

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

Pull Request resolved: #166264
Approved by: https://github.com/XilunWu
yiming0416 pushed a commit that referenced this pull request Oct 27, 2025
…hat we don't need to compare root mesh (#166003) (#166264)

Summary:

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

Pull Request resolved: #166264
Approved by: https://github.com/XilunWu
yiming0416 pushed a commit that referenced this pull request Oct 27, 2025
…hat we don't need to compare root mesh (#166003) (#166264)

Summary:

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

Pull Request resolved: #166264
Approved by: https://github.com/XilunWu
yiming0416 pushed a commit that referenced this pull request Oct 27, 2025
…hat we don't need to compare root mesh (#166003) (#166264)

Summary:

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

Pull Request resolved: #166264
Approved by: https://github.com/XilunWu
tianrengao pushed a commit that referenced this pull request Oct 30, 2025
…hat we don't need to compare root mesh (#166003) (#166264)

Summary:

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in #164510 and further simply the code.

We do have a more ambitious universe-based change here: #165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

Pull Request resolved: #166264
Approved by: https://github.com/XilunWu
@github-actions github-actions bot deleted the gh/fduwjj/217/head branch November 10, 2025 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: DeviceMesh release notes: distributed (checkpoint)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants