Skip to content

[DTensor] Fix slow sharding prop for stack#169519

Closed
wconstab wants to merge 1 commit intogh/wconstab/467/basefrom
gh/wconstab/467/head
Closed

[DTensor] Fix slow sharding prop for stack#169519
wconstab wants to merge 1 commit intogh/wconstab/467/basefrom
gh/wconstab/467/head

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Dec 3, 2025

Stack from ghstack (oldest at bottom):

As identified in the original issue, there is quadratic complexity in
the number of input tensors, due to an improperly written sharding prop
rule.

The previous code generated N output strategies for the stack op, one
based on each of the original N input strategies. However, Each of the
N output strategies was the same. The heuristic in the stack rule is to
find one of the N inputs and follow that one.

We now just generate one output strategy.

Fixes #169445

As identified in the original issue, there is quadratic complexity in
the number of input tensors, due to an improperly written sharding prop
rule.

The previous code generated N output strategies for the stack op, one
based on each of the original N input strategies.  However, Each of the
N output strategies was the same.  The heuristic in the stack rule is to
find one of the N inputs and follow that one.

We now just generate one output strategy.

Fixes #169445

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169519

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit fca2760 with merge base e3f24fd (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab added a commit that referenced this pull request Dec 3, 2025
As identified in the original issue, there is quadratic complexity in
the number of input tensors, due to an improperly written sharding prop
rule.

The previous code generated N output strategies for the stack op, one
based on each of the original N input strategies.  However, Each of the
N output strategies was the same.  The heuristic in the stack rule is to
find one of the N inputs and follow that one.

We now just generate one output strategy.

Fixes #169445

ghstack-source-id: 5b7b303
Pull Request resolved: #169519
@wconstab wconstab requested a review from zpcore December 3, 2025 23:39
first_input_strategy = input_tuple_strategy.children[0]
if not isinstance(first_input_strategy, OpStrategy):
raise AssertionError(f"Expected OpStrategy, got {first_input_strategy}")
input_strategies: list[OpStrategy] = []
Copy link
Contributor Author

@wconstab wconstab Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part was just to make mypy happy below since children are listed as 'StrategyType' which can be Tuple Strategy or OpStrategy, need to ensure they are all OpStrategy...

@zpcore
Copy link
Member

zpcore commented Dec 3, 2025

Hmm, this seems to be a good case where we need detect_exists_identical_opspec to verify op strategy to prevent generating the same opspec.

@wconstab wconstab added the release notes: distributed (dtensor) release notes category label Dec 3, 2025
@wconstab
Copy link
Contributor Author

wconstab commented Dec 3, 2025

Hmm, this seems to be a good case where we need detect_exists_identical_opspec to verify op strategy to prevent generating the same opspec.

why? I don't follow

Copy link
Member

@zpcore zpcore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

output_spec = DTensorSpec(mesh, tuple(follow_placements))
redistribute_cost = []
for input_spec in input_specs:
cost = generate_redistribute_costs(strategy, input_spec)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zpcore one thing i would like to confirm is, this old code looks incorrect to me, in addition to being slower.

we should never be generating the redistribute cost from input 2's placement to input1's dst spec, right? so using 'strategy' here was a bug?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Should be:

for idx, input_spec in enumerate(input_specs):
            cost = generate_redistribute_costs(input_tuple_strategy.children[idx], input_spec)
Image

@zpcore
Copy link
Member

zpcore commented Dec 4, 2025

Hmm, this seems to be a good case where we need detect_exists_identical_opspec to verify op strategy to prevent generating the same opspec.

why? I don't follow

To clearify, detect_exists_identical_opspec is for unittest. We can do:

self.assertTrue(
                    detect_exists_identical_opspec(
                        *sample_input_args,
                        op=aten.stack.default,
                        mesh=mesh,
                        strategy_function=stack_strategy,
                    )

This is a necessary but not sufficient test to say the strategy is not generating duplicated OpSpecs.

@wconstab
Copy link
Contributor Author

wconstab commented Dec 4, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 4, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx942.1)

Details for Dev Infra team Raised by workflow job

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!
Any reason this strategy is not shared with cat() ?

@wconstab
Copy link
Contributor Author

wconstab commented Dec 4, 2025

Historically, not sure. If they can be shared, I'll do it as part of a bigger rewrite I'm working on.

@wconstab
Copy link
Contributor Author

wconstab commented Dec 4, 2025

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx942.1)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@wconstab
Copy link
Contributor Author

wconstab commented Dec 5, 2025

@pytorchbot merge -f

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 5, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot merge: error: argument -f/--force: expected one argument

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Try @pytorchbot --help for more info.

@wconstab
Copy link
Contributor Author

wconstab commented Dec 5, 2025

@pytorchbot merge -f "merge -i got stuck?"

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

umechand-amd pushed a commit to ROCm/pytorch that referenced this pull request Dec 8, 2025
As identified in the original issue, there is quadratic complexity in
the number of input tensors, due to an improperly written sharding prop
rule.

The previous code generated N output strategies for the stack op, one
based on each of the original N input strategies.  However, Each of the
N output strategies was the same.  The heuristic in the stack rule is to
find one of the N inputs and follow that one.

We now just generate one output strategy.

Fixes pytorch#169445
Pull Request resolved: pytorch#169519
Approved by: https://github.com/zpcore, https://github.com/malfet, https://github.com/albanD
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
As identified in the original issue, there is quadratic complexity in
the number of input tensors, due to an improperly written sharding prop
rule.

The previous code generated N output strategies for the stack op, one
based on each of the original N input strategies.  However, Each of the
N output strategies was the same.  The heuristic in the stack rule is to
find one of the N inputs and follow that one.

We now just generate one output strategy.

Fixes #169445
Pull Request resolved: #169519
Approved by: https://github.com/zpcore, https://github.com/malfet, https://github.com/albanD
@github-actions github-actions bot deleted the gh/wconstab/467/head branch January 4, 2026 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (dtensor) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants