Parity of rng offset compute and ranks subset support for Local Tensor#169088
Parity of rng offset compute and ranks subset support for Local Tensor#169088dzmitry-huba wants to merge 2 commits intogh/dzmitry-huba/16/basefrom
Conversation
Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process setup revealed differences in how rng offsets are computed. This change refactors DTensor implementation so that it can be shared with Local Tensor. The existing Local Tensor implementation was incorrectly computing shard linear index based on number of elements in the tensor instead of shard coordinates. AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep" submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the world mesh) to compute collective groups and assumes input local tensors have ranks from the world mesh. Local Tensor mode can be created with subset of ranks. This feature is used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives to execute only if all ranks from the deduced rank groups are present on local tensor inputs. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169088
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 3e4ec54 with merge base 481e5ab ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process setup revealed differences in how rng offsets are computed. This change refactors DTensor implementation so that it can be shared with Local Tensor. The existing Local Tensor implementation was incorrectly computing shard linear index based on number of elements in the tensor instead of shard coordinates. AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep" submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the world mesh) to compute collective groups and assumes input local tensors have ranks from the world mesh. Local Tensor mode can be created with subset of ranks. This feature is used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives to execute only if all ranks from the deduced rank groups are present on local tensor inputs. ghstack-source-id: 756a876 Pull Request resolved: #169088
| state._per_rank_states[rank][8:].view(dtype=torch.int64).item() | ||
| ) | ||
|
|
||
| local_shape = _calc_first_shard_size(spec) |
There was a problem hiding this comment.
ahh, makes sense. idt this should affect the post_op since that will just bump all of the offsets by the full numel
| def get_local_tensor_mode_list() -> list["LocalTensorMode"]: | ||
| global _PROCESS_MODE | ||
| if _PROCESS_MODE: | ||
| global _PROCESS_LOCAL_TENSOR_MODE | ||
| return _PROCESS_LOCAL_TENSOR_MODE | ||
| global _THREAD_LOCAL_TENSOR_MODE | ||
| if not hasattr(_THREAD_LOCAL_TENSOR_MODE, "value"): | ||
| _THREAD_LOCAL_TENSOR_MODE.value = [] | ||
| if len(_THREAD_LOCAL_TENSOR_MODE.value) > 0: | ||
| return _THREAD_LOCAL_TENSOR_MODE.value | ||
| return _GLOBAL_LOCAL_TENSOR_MODE | ||
| return _THREAD_LOCAL_TENSOR_MODE.value | ||
|
|
…Local Tensor" Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process setup revealed differences in how rng offsets are computed. This change refactors DTensor implementation so that it can be shared with Local Tensor. The existing Local Tensor implementation was incorrectly computing shard linear index based on number of elements in the tensor instead of shard coordinates. AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep" submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the world mesh) to compute collective groups and assumes input local tensors have ranks from the world mesh. Local Tensor mode can be created with subset of ranks. This feature is used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives to execute only if all ranks from the deduced rank groups are present on local tensor inputs. [ghstack-poisoned]
Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process setup revealed differences in how rng offsets are computed. This change refactors DTensor implementation so that it can be shared with Local Tensor. The existing Local Tensor implementation was incorrectly computing shard linear index based on number of elements in the tensor instead of shard coordinates. AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep" submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the world mesh) to compute collective groups and assumes input local tensors have ranks from the world mesh. Local Tensor mode can be created with subset of ranks. This feature is used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives to execute only if all ranks from the deduced rank groups are present on local tensor inputs. ghstack-source-id: 8fbecc7 Pull Request resolved: #169088
|
Can you explain a little more why the rank test (as not all ranks are in the mode) is needed? Naively I would have hoped that we never actually call into LocalTensor collectives from an MPMD thread that actually isn't going to do that collective. |
|
This is mainly for the pipeline parallel case. The typical setup is to first create a single device mesh with multiple dimensions (pp, ep_modulo_dp, dp). Slice "world" mesh into SPMD (ep_modulo_dp, dp) and MPMD (pp) sub-meshes. Within SPMD submesh execute collectives excluding send/recv, within MPMD execute send/recv. For each SPMD execution thread we have two choices with respect to how to create local tensor mode (rather which ranks to include there). Option 1) is to create local tensor mode with ranks from "world" mesh even though those ranks are not part of it, option 2) is to create local tensor mode only with ranks from "spmd" sub-mesh. The option 2 is better because we are not executing ops on "ghost" ranks and clearer for debugging. Now if we go with option 2 (see example here https://github.com/meta-pytorch/autoparallel/pull/252/files#diff-f8de38a298340c99adc397cfabcedea5f8c368ae110fe4d2d22a64cc84d74b59R577) then during collective execution we need to take care of the rank groups that were automatically deduced from default process group ("world" process group) and execute it only for those that present in the local tensor (inherited from the local tensor mode created from sub-mesh). Hence the rank test. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
#169088) Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process setup revealed differences in how rng offsets are computed. This change refactors DTensor implementation so that it can be shared with Local Tensor. The existing Local Tensor implementation was incorrectly computing shard linear index based on number of elements in the tensor instead of shard coordinates. AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep" submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the world mesh) to compute collective groups and assumes input local tensors have ranks from the world mesh. Local Tensor mode can be created with subset of ranks. This feature is used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives to execute only if all ranks from the deduced rank groups are present on local tensor inputs. Pull Request resolved: #169088 Approved by: https://github.com/dolpm
Stack from ghstack (oldest at bottom):
Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process
setup revealed differences in how rng offsets are computed. This change refactors DTensor
implementation so that it can be shared with Local Tensor. The existing Local Tensor
implementation was incorrectly computing shard linear index based on number of elements
in the tensor instead of shard coordinates.
AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep"
submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the
world mesh) to compute collective groups and assumes input local tensors have ranks from
the world mesh. Local Tensor mode can be created with subset of ranks. This feature is
used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives
to execute only if all ranks from the deduced rank groups are present on local tensor inputs.