Skip to content

Parity of rng offset compute and ranks subset support for Local Tensor#169088

Closed
dzmitry-huba wants to merge 2 commits intogh/dzmitry-huba/16/basefrom
gh/dzmitry-huba/16/head
Closed

Parity of rng offset compute and ranks subset support for Local Tensor#169088
dzmitry-huba wants to merge 2 commits intogh/dzmitry-huba/16/basefrom
gh/dzmitry-huba/16/head

Conversation

@dzmitry-huba
Copy link
Contributor

@dzmitry-huba dzmitry-huba commented Nov 25, 2025

Stack from ghstack (oldest at bottom):

Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process
setup revealed differences in how rng offsets are computed. This change refactors DTensor
implementation so that it can be shared with Local Tensor. The existing Local Tensor
implementation was incorrectly computing shard linear index based on number of elements
in the tensor instead of shard coordinates.

AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep"
submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the
world mesh) to compute collective groups and assumes input local tensors have ranks from
the world mesh. Local Tensor mode can be created with subset of ranks. This feature is
used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives
to execute only if all ranks from the deduced rank groups are present on local tensor inputs.

Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process
setup revealed differences in how rng offsets are computed. This change refactors DTensor
implementation so that it can be shared with Local Tensor. The existing Local Tensor
implementation was incorrectly computing shard linear index based on number of elements
in the tensor instead of shard coordinates.

AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep"
submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the
world mesh) to compute collective groups and assumes input local tensors have ranks from
the world mesh. Local Tensor mode can be created with subset of ranks. This feature is
used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives
to execute only if all ranks from the deduced rank groups are present on local tensor inputs.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169088

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3e4ec54 with merge base 481e5ab (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

dzmitry-huba added a commit that referenced this pull request Nov 25, 2025
Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process
setup revealed differences in how rng offsets are computed. This change refactors DTensor
implementation so that it can be shared with Local Tensor. The existing Local Tensor
implementation was incorrectly computing shard linear index based on number of elements
in the tensor instead of shard coordinates.

AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep"
submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the
world mesh) to compute collective groups and assumes input local tensors have ranks from
the world mesh. Local Tensor mode can be created with subset of ranks. This feature is
used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives
to execute only if all ranks from the deduced rank groups are present on local tensor inputs.

ghstack-source-id: 756a876
Pull Request resolved: #169088
@dzmitry-huba dzmitry-huba requested review from dolpm and ezyang November 25, 2025 23:27
@dzmitry-huba dzmitry-huba marked this pull request as ready for review November 25, 2025 23:27
Copy link
Contributor

@dolpm dolpm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for fixing this!

state._per_rank_states[rank][8:].view(dtype=torch.int64).item()
)

local_shape = _calc_first_shard_size(spec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, makes sense. idt this should affect the post_op since that will just bump all of the offsets by the full numel

Comment on lines 1130 to 1139
def get_local_tensor_mode_list() -> list["LocalTensorMode"]:
global _PROCESS_MODE
if _PROCESS_MODE:
global _PROCESS_LOCAL_TENSOR_MODE
return _PROCESS_LOCAL_TENSOR_MODE
global _THREAD_LOCAL_TENSOR_MODE
if not hasattr(_THREAD_LOCAL_TENSOR_MODE, "value"):
_THREAD_LOCAL_TENSOR_MODE.value = []
if len(_THREAD_LOCAL_TENSOR_MODE.value) > 0:
return _THREAD_LOCAL_TENSOR_MODE.value
return _GLOBAL_LOCAL_TENSOR_MODE
return _THREAD_LOCAL_TENSOR_MODE.value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:(

…Local Tensor"

Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process
setup revealed differences in how rng offsets are computed. This change refactors DTensor
implementation so that it can be shared with Local Tensor. The existing Local Tensor
implementation was incorrectly computing shard linear index based on number of elements
in the tensor instead of shard coordinates.

AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep"
submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the
world mesh) to compute collective groups and assumes input local tensors have ranks from
the world mesh. Local Tensor mode can be created with subset of ranks. This feature is
used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives
to execute only if all ranks from the deduced rank groups are present on local tensor inputs.

[ghstack-poisoned]
dzmitry-huba added a commit that referenced this pull request Dec 1, 2025
Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process
setup revealed differences in how rng offsets are computed. This change refactors DTensor
implementation so that it can be shared with Local Tensor. The existing Local Tensor
implementation was incorrectly computing shard linear index based on number of elements
in the tensor instead of shard coordinates.

AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep"
submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the
world mesh) to compute collective groups and assumes input local tensors have ranks from
the world mesh. Local Tensor mode can be created with subset of ranks. This feature is
used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives
to execute only if all ranks from the deduced rank groups are present on local tensor inputs.

ghstack-source-id: 8fbecc7
Pull Request resolved: #169088
@ezyang
Copy link
Contributor

ezyang commented Dec 1, 2025

Can you explain a little more why the rank test (as not all ranks are in the mode) is needed? Naively I would have hoped that we never actually call into LocalTensor collectives from an MPMD thread that actually isn't going to do that collective.

@dzmitry-huba
Copy link
Contributor Author

This is mainly for the pipeline parallel case. The typical setup is to first create a single device mesh with multiple dimensions (pp, ep_modulo_dp, dp). Slice "world" mesh into SPMD (ep_modulo_dp, dp) and MPMD (pp) sub-meshes. Within SPMD submesh execute collectives excluding send/recv, within MPMD execute send/recv. For each SPMD execution thread we have two choices with respect to how to create local tensor mode (rather which ranks to include there). Option 1) is to create local tensor mode with ranks from "world" mesh even though those ranks are not part of it, option 2) is to create local tensor mode only with ranks from "spmd" sub-mesh. The option 2 is better because we are not executing ops on "ghost" ranks and clearer for debugging. Now if we go with option 2 (see example here https://github.com/meta-pytorch/autoparallel/pull/252/files#diff-f8de38a298340c99adc397cfabcedea5f8c368ae110fe4d2d22a64cc84d74b59R577) then during collective execution we need to take care of the rank groups that were automatically deduced from default process group ("world" process group) and execute it only for those that present in the local tensor (inherited from the local tensor mode created from sub-mesh). Hence the rank test.

@dzmitry-huba
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 1, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
#169088)

Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process
setup revealed differences in how rng offsets are computed. This change refactors DTensor
implementation so that it can be shared with Local Tensor. The existing Local Tensor
implementation was incorrectly computing shard linear index based on number of elements
in the tensor instead of shard coordinates.

AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep"
submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the
world mesh) to compute collective groups and assumes input local tensors have ranks from
the world mesh. Local Tensor mode can be created with subset of ranks. This feature is
used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives
to execute only if all ranks from the deduced rank groups are present on local tensor inputs.
Pull Request resolved: #169088
Approved by: https://github.com/dolpm
@github-actions github-actions bot deleted the gh/dzmitry-huba/16/head branch January 2, 2026 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants