Parity of rng offset compute and ranks subset support for Local Tensor by dzmitry-huba · Pull Request #169088 · pytorch/pytorch

dzmitry-huba · 2025-11-25T23:26:53Z

Stack from ghstack (oldest at bottom):

-> Parity of rng offset compute and ranks subset support for Local Tensor #169088

Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process
setup revealed differences in how rng offsets are computed. This change refactors DTensor
implementation so that it can be shared with Local Tensor. The existing Local Tensor
implementation was incorrectly computing shard linear index based on number of elements
in the tensor instead of shard coordinates.

AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep"
submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the
world mesh) to compute collective groups and assumes input local tensors have ranks from
the world mesh. Local Tensor mode can be created with subset of ranks. This feature is
used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives
to execute only if all ranks from the deduced rank groups are present on local tensor inputs.

Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process setup revealed differences in how rng offsets are computed. This change refactors DTensor implementation so that it can be shared with Local Tensor. The existing Local Tensor implementation was incorrectly computing shard linear index based on number of elements in the tensor instead of shard coordinates. AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep" submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the world mesh) to compute collective groups and assumes input local tensors have ranks from the world mesh. Local Tensor mode can be created with subset of ranks. This feature is used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives to execute only if all ranks from the deduced rank groups are present on local tensor inputs. [ghstack-poisoned]

pytorch-bot · 2025-11-25T23:26:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169088

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3e4ec54 with merge base 481e5ab ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process setup revealed differences in how rng offsets are computed. This change refactors DTensor implementation so that it can be shared with Local Tensor. The existing Local Tensor implementation was incorrectly computing shard linear index based on number of elements in the tensor instead of shard coordinates. AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep" submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the world mesh) to compute collective groups and assumes input local tensors have ranks from the world mesh. Local Tensor mode can be created with subset of ranks. This feature is used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives to execute only if all ranks from the deduced rank groups are present on local tensor inputs. ghstack-source-id: 756a876 Pull Request resolved: #169088

dolpm

thanks for fixing this!

dolpm · 2025-11-26T04:52:17Z

torch/distributed/_local_tensor/__init__.py

                state._per_rank_states[rank][8:].view(dtype=torch.int64).item()
            )

+            local_shape = _calc_first_shard_size(spec)


ahh, makes sense. idt this should affect the post_op since that will just bump all of the offsets by the full numel

dolpm · 2025-11-26T04:53:52Z

torch/distributed/_local_tensor/__init__.py

 def get_local_tensor_mode_list() -> list["LocalTensorMode"]:
+    global _PROCESS_MODE
+    if _PROCESS_MODE:
+        global _PROCESS_LOCAL_TENSOR_MODE
+        return _PROCESS_LOCAL_TENSOR_MODE
+    global _THREAD_LOCAL_TENSOR_MODE
    if not hasattr(_THREAD_LOCAL_TENSOR_MODE, "value"):
        _THREAD_LOCAL_TENSOR_MODE.value = []
-    if len(_THREAD_LOCAL_TENSOR_MODE.value) > 0:
-        return _THREAD_LOCAL_TENSOR_MODE.value
-    return _GLOBAL_LOCAL_TENSOR_MODE
+    return _THREAD_LOCAL_TENSOR_MODE.value



…Local Tensor" Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process setup revealed differences in how rng offsets are computed. This change refactors DTensor implementation so that it can be shared with Local Tensor. The existing Local Tensor implementation was incorrectly computing shard linear index based on number of elements in the tensor instead of shard coordinates. AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep" submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the world mesh) to compute collective groups and assumes input local tensors have ranks from the world mesh. Local Tensor mode can be created with subset of ranks. This feature is used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives to execute only if all ranks from the deduced rank groups are present on local tensor inputs. [ghstack-poisoned]

Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process setup revealed differences in how rng offsets are computed. This change refactors DTensor implementation so that it can be shared with Local Tensor. The existing Local Tensor implementation was incorrectly computing shard linear index based on number of elements in the tensor instead of shard coordinates. AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep" submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the world mesh) to compute collective groups and assumes input local tensors have ranks from the world mesh. Local Tensor mode can be created with subset of ranks. This feature is used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives to execute only if all ranks from the deduced rank groups are present on local tensor inputs. ghstack-source-id: 8fbecc7 Pull Request resolved: #169088

ezyang · 2025-12-01T18:29:56Z

Can you explain a little more why the rank test (as not all ranks are in the mode) is needed? Naively I would have hoped that we never actually call into LocalTensor collectives from an MPMD thread that actually isn't going to do that collective.

dzmitry-huba · 2025-12-01T18:41:47Z

This is mainly for the pipeline parallel case. The typical setup is to first create a single device mesh with multiple dimensions (pp, ep_modulo_dp, dp). Slice "world" mesh into SPMD (ep_modulo_dp, dp) and MPMD (pp) sub-meshes. Within SPMD submesh execute collectives excluding send/recv, within MPMD execute send/recv. For each SPMD execution thread we have two choices with respect to how to create local tensor mode (rather which ranks to include there). Option 1) is to create local tensor mode with ranks from "world" mesh even though those ranks are not part of it, option 2) is to create local tensor mode only with ranks from "spmd" sub-mesh. The option 2 is better because we are not executing ops on "ghost" ranks and clearer for debugging. Now if we go with option 2 (see example here https://github.com/meta-pytorch/autoparallel/pull/252/files#diff-f8de38a298340c99adc397cfabcedea5f8c368ae110fe4d2d22a64cc84d74b59R577) then during collective execution we need to take care of the rank groups that were automatically deduced from default process group ("world" process group) and execute it only for those that present in the local tensor (inherited from the local tensor mode created from sub-mesh). Hence the rank test.

dzmitry-huba · 2025-12-01T21:36:50Z

@pytorchbot merge

pytorchmergebot · 2025-12-01T21:38:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

#169088) Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process setup revealed differences in how rng offsets are computed. This change refactors DTensor implementation so that it can be shared with Local Tensor. The existing Local Tensor implementation was incorrectly computing shard linear index based on number of elements in the tensor instead of shard coordinates. AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep" submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the world mesh) to compute collective groups and assumes input local tensors have ranks from the world mesh. Local Tensor mode can be created with subset of ranks. This feature is used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives to execute only if all ranks from the deduced rank groups are present on local tensor inputs. Pull Request resolved: #169088 Approved by: https://github.com/dolpm

pytorch-bot bot added ciflow/inductor release notes: distributed (c10d) release notes category labels Nov 25, 2025

dzmitry-huba requested review from dolpm and ezyang November 25, 2025 23:27

dzmitry-huba marked this pull request as ready for review November 25, 2025 23:27

dolpm approved these changes Nov 26, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 1, 2025

pytorchmergebot added the merging label Dec 1, 2025

pytorchmergebot added the Merged label Dec 2, 2025

pytorchmergebot closed this in 1e526fb Dec 2, 2025

pytorchmergebot removed the merging label Dec 2, 2025

github-actions bot deleted the gh/dzmitry-huba/16/head branch January 2, 2026 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parity of rng offset compute and ranks subset support for Local Tensor#169088

Parity of rng offset compute and ranks subset support for Local Tensor#169088
dzmitry-huba wants to merge 2 commits intogh/dzmitry-huba/16/basefrom
gh/dzmitry-huba/16/head

dzmitry-huba commented Nov 25, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 25, 2025 •

edited

Loading

Uh oh!

dolpm left a comment

Uh oh!

dolpm Nov 26, 2025

Uh oh!

dolpm Nov 26, 2025

Uh oh!

ezyang commented Dec 1, 2025

Uh oh!

dzmitry-huba commented Dec 1, 2025

Uh oh!

dzmitry-huba commented Dec 1, 2025

Uh oh!

pytorchmergebot commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dzmitry-huba commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169088

✅ No Failures

Uh oh!

dolpm left a comment

Choose a reason for hiding this comment

Uh oh!

dolpm Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

dolpm Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

ezyang commented Dec 1, 2025

Uh oh!

dzmitry-huba commented Dec 1, 2025

Uh oh!

dzmitry-huba commented Dec 1, 2025

Uh oh!

pytorchmergebot commented Dec 1, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dzmitry-huba commented Nov 25, 2025 •

edited

Loading

pytorch-bot bot commented Nov 25, 2025 •

edited

Loading