[DTensor] Fix slow sharding prop for stack by wconstab · Pull Request #169519 · pytorch/pytorch

wconstab · 2025-12-03T23:39:07Z

Stack from ghstack (oldest at bottom):

-> [DTensor] Fix slow sharding prop for stack #169519

As identified in the original issue, there is quadratic complexity in
the number of input tensors, due to an improperly written sharding prop
rule.

The previous code generated N output strategies for the stack op, one
based on each of the original N input strategies. However, Each of the
N output strategies was the same. The heuristic in the stack rule is to
find one of the N inputs and follow that one.

We now just generate one output strategy.

Fixes #169445

As identified in the original issue, there is quadratic complexity in the number of input tensors, due to an improperly written sharding prop rule. The previous code generated N output strategies for the stack op, one based on each of the original N input strategies. However, Each of the N output strategies was the same. The heuristic in the stack rule is to find one of the N inputs and follow that one. We now just generate one output strategy. Fixes #169445 [ghstack-poisoned]

pytorch-bot · 2025-12-03T23:39:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169519

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit fca2760 with merge base e3f24fd ():

NEW FAILURE - The following job has failed:

trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx942.1) (gh)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

As identified in the original issue, there is quadratic complexity in the number of input tensors, due to an improperly written sharding prop rule. The previous code generated N output strategies for the stack op, one based on each of the original N input strategies. However, Each of the N output strategies was the same. The heuristic in the stack rule is to find one of the N inputs and follow that one. We now just generate one output strategy. Fixes #169445 ghstack-source-id: 5b7b303 Pull Request resolved: #169519

wconstab · 2025-12-03T23:40:20Z

torch/distributed/tensor/_ops/_tensor_ops.py

-    first_input_strategy = input_tuple_strategy.children[0]
-    if not isinstance(first_input_strategy, OpStrategy):
-        raise AssertionError(f"Expected OpStrategy, got {first_input_strategy}")
+    input_strategies: list[OpStrategy] = []


this part was just to make mypy happy below since children are listed as 'StrategyType' which can be Tuple Strategy or OpStrategy, need to ensure they are all OpStrategy...

zpcore · 2025-12-03T23:48:45Z

Hmm, this seems to be a good case where we need detect_exists_identical_opspec to verify op strategy to prevent generating the same opspec.

wconstab · 2025-12-03T23:57:07Z

Hmm, this seems to be a good case where we need detect_exists_identical_opspec to verify op strategy to prevent generating the same opspec.

why? I don't follow

zpcore

LGTM!

wconstab · 2025-12-04T00:41:05Z

torch/distributed/tensor/_ops/_tensor_ops.py

-        output_spec = DTensorSpec(mesh, tuple(follow_placements))
-        redistribute_cost = []
-        for input_spec in input_specs:
-            cost = generate_redistribute_costs(strategy, input_spec)


@zpcore one thing i would like to confirm is, this old code looks incorrect to me, in addition to being slower.

we should never be generating the redistribute cost from input 2's placement to input1's dst spec, right? so using 'strategy' here was a bug?

Good catch! Should be:

for idx, input_spec in enumerate(input_specs): cost = generate_redistribute_costs(input_tuple_strategy.children[idx], input_spec)

zpcore · 2025-12-04T00:46:28Z

Hmm, this seems to be a good case where we need detect_exists_identical_opspec to verify op strategy to prevent generating the same opspec.

why? I don't follow

To clearify, detect_exists_identical_opspec is for unittest. We can do:

self.assertTrue(
                    detect_exists_identical_opspec(
                        *sample_input_args,
                        op=aten.stack.default,
                        mesh=mesh,
                        strategy_function=stack_strategy,
                    )

This is a necessary but not sufficient test to say the strategy is not generating duplicated OpSpecs.

wconstab · 2025-12-04T02:21:09Z

@pytorchbot merge

pytorchmergebot · 2025-12-04T02:23:10Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-12-04T07:37:03Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx942.1)

Details for Dev Infra team

Raised by workflow job

albanD

Sounds good!
Any reason this strategy is not shared with cat() ?

wconstab · 2025-12-04T16:10:19Z

Historically, not sure. If they can be shared, I'll do it as part of a bigger rewrite I'm working on.

wconstab · 2025-12-04T19:01:51Z

@pytorchbot merge -i

pytorchmergebot · 2025-12-04T19:03:45Z

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx942.1)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

wconstab · 2025-12-05T00:16:34Z

@pytorchbot merge -f

pytorch-bot · 2025-12-05T00:16:36Z

❌ 🤖 pytorchbot command failed:

@pytorchbot merge: error: argument -f/--force: expected one argument

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Try @pytorchbot --help for more info.

wconstab · 2025-12-05T00:16:52Z

@pytorchbot merge -f "merge -i got stuck?"

pytorchmergebot · 2025-12-05T00:17:11Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2025-12-05T00:18:52Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

As identified in the original issue, there is quadratic complexity in the number of input tensors, due to an improperly written sharding prop rule. The previous code generated N output strategies for the stack op, one based on each of the original N input strategies. However, Each of the N output strategies was the same. The heuristic in the stack rule is to find one of the N inputs and follow that one. We now just generate one output strategy. Fixes pytorch#169445 Pull Request resolved: pytorch#169519 Approved by: https://github.com/zpcore, https://github.com/malfet, https://github.com/albanD

As identified in the original issue, there is quadratic complexity in the number of input tensors, due to an improperly written sharding prop rule. The previous code generated N output strategies for the stack op, one based on each of the original N input strategies. However, Each of the N output strategies was the same. The heuristic in the stack rule is to find one of the N inputs and follow that one. We now just generate one output strategy. Fixes #169445 Pull Request resolved: #169519 Approved by: https://github.com/zpcore, https://github.com/malfet, https://github.com/albanD

pytorch-bot bot added the ciflow/inductor label Dec 3, 2025

wconstab mentioned this pull request Dec 3, 2025

Quadratic complexity in gradient clipping #169445

Closed

wconstab requested a review from zpcore December 3, 2025 23:39

wconstab commented Dec 3, 2025

View reviewed changes

wconstab added the release notes: distributed (dtensor) release notes category label Dec 3, 2025

zpcore approved these changes Dec 4, 2025

View reviewed changes

wconstab commented Dec 4, 2025

View reviewed changes

malfet approved these changes Dec 4, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 4, 2025

pytorchmergebot added the merging label Dec 4, 2025

pytorchmergebot removed the merging label Dec 4, 2025

albanD approved these changes Dec 4, 2025

View reviewed changes

pytorchmergebot added the merging label Dec 4, 2025

pytorchmergebot added the Merged label Dec 5, 2025

pytorchmergebot closed this in 4380129 Dec 5, 2025

pytorchmergebot removed the merging label Dec 5, 2025

github-actions bot deleted the gh/wconstab/467/head branch January 4, 2026 02:21

Conversation

wconstab commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169519

❌ 1 New Failure

Uh oh!

wconstab Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zpcore commented Dec 3, 2025

Uh oh!

wconstab commented Dec 3, 2025

Uh oh!

zpcore left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

zpcore Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

zpcore commented Dec 4, 2025

Uh oh!

wconstab commented Dec 4, 2025

Uh oh!

pytorchmergebot commented Dec 4, 2025

Merge started

Uh oh!

pytorchmergebot commented Dec 4, 2025

Merge failed

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Dec 4, 2025

Uh oh!

wconstab commented Dec 4, 2025

Uh oh!

pytorchmergebot commented Dec 4, 2025

Merge started

Uh oh!

wconstab commented Dec 5, 2025

Uh oh!

pytorch-bot bot commented Dec 5, 2025

Uh oh!

wconstab commented Dec 5, 2025

Uh oh!

pytorchmergebot commented Dec 5, 2025

Uh oh!

pytorchmergebot commented Dec 5, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wconstab commented Dec 3, 2025 •

edited

Loading

pytorch-bot bot commented Dec 3, 2025 •

edited

Loading

wconstab Dec 3, 2025 •

edited

Loading