Skip to content

[DTensor] unbacked matmuls for no-redistribute case#168051

Closed
pianpwk wants to merge 11 commits intogh/pianpwk/34/basefrom
gh/pianpwk/34/head
Closed

[DTensor] unbacked matmuls for no-redistribute case#168051
pianpwk wants to merge 11 commits intogh/pianpwk/34/basefrom
gh/pianpwk/34/head

Conversation

@pianpwk
Copy link
Contributor

@pianpwk pianpwk commented Nov 18, 2025

Stack from ghstack (oldest at bottom):

This allows compiling a matmul on 2 DTensors with fully unbacked sizes, when a zero-cost strategy is available.

Changes with the PR:

  • mark_unbacked() would previously error on tensor subclasses; now for DTensors it allocates unbacked symbols for both inner & outer sizes. The main motivation here is for testing, so happy to tweak semantics. The unbacked binding search process also now matches on DTensor outer sizes.
  • Selecting an op strategy in sharding propagation is based on minimal redistribution costs, and these costs are functions of tensor shapes, so can be unbacked expressions. This PR makes this process more unbacked-friendly, choosing negative or zero-cost strategies when they're available. When these "trivial" strategies aren't available, selection requires comparing unbacked costs, addressed in the next PR (with usage of fallback hints).
  • For matmul strategies, sharding prop rules filter out strategies where the matmul inputs fail the is_tensor_shardable check on the given DeviceMesh. In eager, this filters out cases where size of sharded dim < num shards. In the compiled & unbacked case, we'll often encounter dim size u_ where u_ can be both larger and smaller than num shards. This PR assumes such cases are shardable by default, and the implication is that strategies that shard on unbacked dimensions are included for consideration, and if selected, can lead to uneven sharding/zero-size shards at runtime. Alternatives would be 1) the current state of things: DDE and force the user to pick a path: torch._check(size of sharded dim < or >= num shards), or 2) assume the non-shardable case and never include sharded strategies, unless the user picks the shardable path. More discussion in DTensor Matmul Compile with Unbacked Symint Failure #165034 (comment).
  • Lastly, testing traced redistribution decisions required using aot_eager backend, so that the collectives/ops were hardcoded (eager backend would go through DTensor.dispatch again). This seemed to require re-enabling proxy tracking during shard prop, basically reverting [dtensor][compile] Disable proxy mode in sharding prop rules #163126. Otherwise, errors like RuntimeError: Max(1, u2) (<class 'torch.SymInt'>, 140294330350224)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7f98d1b14af0> show up for DTensor outer strides...

cc @ezyang @EikanWang @jgong5 @wenzhe-nrv @voznesenskym @penguinwu @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @jiayisunx @kadeng @chauhang @amjames @Lucaskabela @jataylo @chenyang78 @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168051

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 6dd338b with merge base 7b7af39 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor module: dynamo oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: fx release notes category labels Nov 18, 2025
@pytorch-bot pytorch-bot bot removed the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 18, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 18, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@pianpwk pianpwk marked this pull request as draft November 18, 2025 00:10
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 18, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

cc ezyang EikanWang jgong5 wenzhe-nrv voznesenskym penguinwu Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 18, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@pianpwk pianpwk changed the title basic unbacked matmuls [DTensor][dynamic shapes] unbacked matmuls for no-redistribute case Nov 18, 2025
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 18, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@pianpwk pianpwk changed the title [DTensor][dynamic shapes] unbacked matmuls for no-redistribute case [DTensor] unbacked matmuls for no-redistribute case Nov 18, 2025
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 18, 2025
elif min_cost == 0 and no_redistribute_strategy_index != -1:
# If there's no redistribute cost, we select the one with no redistribute.
if op_schema is not None:
if guard_or_false(redistribute_cost < 0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the old comment that explains the negative cost.

   # If there's negative cost, we select the one with the minimal cost,
            # even if this means we need to redistribute, e.g. via local chunking.
            # E.g. this can happen for ops in self.op_to_shape_and_stride_idx
            # when the inputs / outputs are sharded.

if (
negative_cost_index == -1
or redistribute_cost < op_spec_costs[negative_cost_index]
# assume negative costs are coefficients, so we don't need guard_or_false here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm you are not assuming anything
you know for sure that redistribute_cost is not unbacked and op_spec_costs does[negative_cost_index] also is never unbacked. well you are assuming that because you was able to check guard_or_false(redistribute_cost < 0):
but if there is was a torch check(u0<0)
and cost1 was u0 and
Same for cost 2 u1 then comparing u0 and u1 would fail

just add guard_or_false or wait until if we hit DDE here, i mean this can be trick to rerpo.
but no need for this comment its confusing

elif zero_cost_index == -1:
zero_cost_index = strategy_idx

# prioritize negative/zero/no redistribute cost strategies
Copy link
Contributor

@laithsakka laithsakka Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment that we could end up not picking the lowest cost but its ok if any is known 0 or neg or no_redistribute then we dont want to throw DDE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm all other costs (unbacked expressions) will be non-negative, so they're always worse options than the zero/negative cost options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with respect to actual inputs have things not been unbacked we could have a picked a lower cost (that would be neg) [if we ran eagirly for example]

@laithsakka
Copy link
Contributor

looks good and reasonable overall some comments and nits and questions.

@pianpwk
Copy link
Contributor Author

pianpwk commented Dec 2, 2025

mm so when we create a DTensor with replicate /shared placements we probably want to know the relation between the unbacked symbols (the locals and the globals) for example.

I agree in the Replicate/Partial case where global sizes are preserved, but with Shard and uneven sharding it's probably safer to keep them separate as default.

@pianpwk
Copy link
Contributor Author

pianpwk commented Dec 2, 2025

for the last point "Lastly, testing traced redistribution decisions required using aot_eager backend, ..." you explained the issue, in the summary can you explain what change you did to address that if any?

This is the change in _sharding_prop.py where disable_proxy_modes_tracing() is removed

This allows compiling a matmul on 2 DTensors with fully unbacked sizes, when a zero-cost strategy is available.

Changes with the PR:
- `mark_unbacked()` would previously error on tensor subclasses; now for DTensors it allocates unbacked symbols for both inner & outer sizes. The main motivation here is for testing, so happy to tweak semantics. The unbacked binding search process also now matches on DTensor outer sizes.
- Selecting an op strategy in sharding propagation is based on minimal redistribution costs, and these costs are functions of tensor shapes, so can be unbacked expressions. This PR makes this process more unbacked-friendly, choosing negative or zero-cost strategies when they're available. When these "trivial" strategies aren't available, selection requires comparing unbacked costs, addressed in the next PR (with usage of fallback hints).
- For matmul strategies, sharding prop rules filter out strategies where the matmul inputs fail the `is_tensor_shardable` check on the given DeviceMesh. In eager, this filters out cases where `size of sharded dim < num shards`. In the compiled & unbacked case, we'll often encounter dim size `u_` where `u_` can be both larger and smaller than num shards. This PR assumes such cases are shardable by default, and the implication is that strategies that shard on unbacked dimensions are included for consideration, and if selected, can lead to uneven sharding/zero-size shards at runtime. Alternatives would be 1) the current state of things: DDE and force the user to pick a path: `torch._check(size of sharded dim < or >= num shards)`, or 2) assume the non-shardable case and never include sharded strategies, unless the user picks the shardable path. More discussion in #165034 (comment).
- Lastly, testing traced redistribution decisions required using aot_eager backend, so that the collectives/ops were hardcoded (eager backend would go through DTensor.dispatch again). This seemed to require re-enabling proxy tracking during shard prop, basically reverting #163126. Otherwise, errors like `RuntimeError: Max(1, u2) (<class 'torch.SymInt'>, 140294330350224)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7f98d1b14af0>` show up for DTensor outer strides...

cc ezyang EikanWang jgong5 wenzhe-nrv voznesenskym penguinwu Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
This allows compiling a matmul on 2 DTensors with fully unbacked sizes, when a zero-cost strategy is available.

Changes with the PR:
- `mark_unbacked()` would previously error on tensor subclasses; now for DTensors it allocates unbacked symbols for both inner & outer sizes. The main motivation here is for testing, so happy to tweak semantics. The unbacked binding search process also now matches on DTensor outer sizes.
- Selecting an op strategy in sharding propagation is based on minimal redistribution costs, and these costs are functions of tensor shapes, so can be unbacked expressions. This PR makes this process more unbacked-friendly, choosing negative or zero-cost strategies when they're available. When these "trivial" strategies aren't available, selection requires comparing unbacked costs, addressed in the next PR (with usage of fallback hints).
- For matmul strategies, sharding prop rules filter out strategies where the matmul inputs fail the `is_tensor_shardable` check on the given DeviceMesh. In eager, this filters out cases where `size of sharded dim < num shards`. In the compiled & unbacked case, we'll often encounter dim size `u_` where `u_` can be both larger and smaller than num shards. This PR assumes such cases are shardable by default, and the implication is that strategies that shard on unbacked dimensions are included for consideration, and if selected, can lead to uneven sharding/zero-size shards at runtime. Alternatives would be 1) the current state of things: DDE and force the user to pick a path: `torch._check(size of sharded dim < or >= num shards)`, or 2) assume the non-shardable case and never include sharded strategies, unless the user picks the shardable path. More discussion in #165034 (comment).
- Lastly, testing traced redistribution decisions required using aot_eager backend, so that the collectives/ops were hardcoded (eager backend would go through DTensor.dispatch again). This seemed to require re-enabling proxy tracking during shard prop, basically reverting #163126. Otherwise, errors like `RuntimeError: Max(1, u2) (<class 'torch.SymInt'>, 140294330350224)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7f98d1b14af0>` show up for DTensor outer strides...

cc ezyang EikanWang jgong5 wenzhe-nrv voznesenskym penguinwu Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
This allows compiling a matmul on 2 DTensors with fully unbacked sizes, when a zero-cost strategy is available.

Changes with the PR:
- `mark_unbacked()` would previously error on tensor subclasses; now for DTensors it allocates unbacked symbols for both inner & outer sizes. The main motivation here is for testing, so happy to tweak semantics. The unbacked binding search process also now matches on DTensor outer sizes.
- Selecting an op strategy in sharding propagation is based on minimal redistribution costs, and these costs are functions of tensor shapes, so can be unbacked expressions. This PR makes this process more unbacked-friendly, choosing negative or zero-cost strategies when they're available. When these "trivial" strategies aren't available, selection requires comparing unbacked costs, addressed in the next PR (with usage of fallback hints).
- For matmul strategies, sharding prop rules filter out strategies where the matmul inputs fail the `is_tensor_shardable` check on the given DeviceMesh. In eager, this filters out cases where `size of sharded dim < num shards`. In the compiled & unbacked case, we'll often encounter dim size `u_` where `u_` can be both larger and smaller than num shards. This PR assumes such cases are shardable by default, and the implication is that strategies that shard on unbacked dimensions are included for consideration, and if selected, can lead to uneven sharding/zero-size shards at runtime. Alternatives would be 1) the current state of things: DDE and force the user to pick a path: `torch._check(size of sharded dim < or >= num shards)`, or 2) assume the non-shardable case and never include sharded strategies, unless the user picks the shardable path. More discussion in #165034 (comment).
- Lastly, testing traced redistribution decisions required using aot_eager backend, so that the collectives/ops were hardcoded (eager backend would go through DTensor.dispatch again). This seemed to require re-enabling proxy tracking during shard prop, basically reverting #163126. Otherwise, errors like `RuntimeError: Max(1, u2) (<class 'torch.SymInt'>, 140294330350224)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7f98d1b14af0>` show up for DTensor outer strides...

cc ezyang EikanWang jgong5 wenzhe-nrv voznesenskym penguinwu Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
@pianpwk
Copy link
Contributor Author

pianpwk commented Dec 4, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 4, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m2-15)

Details for Dev Infra team Raised by workflow job

This allows compiling a matmul on 2 DTensors with fully unbacked sizes, when a zero-cost strategy is available.

Changes with the PR:
- `mark_unbacked()` would previously error on tensor subclasses; now for DTensors it allocates unbacked symbols for both inner & outer sizes. The main motivation here is for testing, so happy to tweak semantics. The unbacked binding search process also now matches on DTensor outer sizes.
- Selecting an op strategy in sharding propagation is based on minimal redistribution costs, and these costs are functions of tensor shapes, so can be unbacked expressions. This PR makes this process more unbacked-friendly, choosing negative or zero-cost strategies when they're available. When these "trivial" strategies aren't available, selection requires comparing unbacked costs, addressed in the next PR (with usage of fallback hints).
- For matmul strategies, sharding prop rules filter out strategies where the matmul inputs fail the `is_tensor_shardable` check on the given DeviceMesh. In eager, this filters out cases where `size of sharded dim < num shards`. In the compiled & unbacked case, we'll often encounter dim size `u_` where `u_` can be both larger and smaller than num shards. This PR assumes such cases are shardable by default, and the implication is that strategies that shard on unbacked dimensions are included for consideration, and if selected, can lead to uneven sharding/zero-size shards at runtime. Alternatives would be 1) the current state of things: DDE and force the user to pick a path: `torch._check(size of sharded dim < or >= num shards)`, or 2) assume the non-shardable case and never include sharded strategies, unless the user picks the shardable path. More discussion in #165034 (comment).
- Lastly, testing traced redistribution decisions required using aot_eager backend, so that the collectives/ops were hardcoded (eager backend would go through DTensor.dispatch again). This seemed to require re-enabling proxy tracking during shard prop, basically reverting #163126. Otherwise, errors like `RuntimeError: Max(1, u2) (<class 'torch.SymInt'>, 140294330350224)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7f98d1b14af0>` show up for DTensor outer strides...

cc ezyang EikanWang jgong5 wenzhe-nrv voznesenskym penguinwu Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx kadeng chauhang amjames Lucaskabela jataylo chenyang78 H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
This allows compiling a matmul on 2 DTensors with fully unbacked sizes, when a zero-cost strategy is available.

Changes with the PR:
- `mark_unbacked()` would previously error on tensor subclasses; now for DTensors it allocates unbacked symbols for both inner & outer sizes. The main motivation here is for testing, so happy to tweak semantics. The unbacked binding search process also now matches on DTensor outer sizes.
- Selecting an op strategy in sharding propagation is based on minimal redistribution costs, and these costs are functions of tensor shapes, so can be unbacked expressions. This PR makes this process more unbacked-friendly, choosing negative or zero-cost strategies when they're available. When these "trivial" strategies aren't available, selection requires comparing unbacked costs, addressed in the next PR (with usage of fallback hints).
- For matmul strategies, sharding prop rules filter out strategies where the matmul inputs fail the `is_tensor_shardable` check on the given DeviceMesh. In eager, this filters out cases where `size of sharded dim < num shards`. In the compiled & unbacked case, we'll often encounter dim size `u_` where `u_` can be both larger and smaller than num shards. This PR assumes such cases are shardable by default, and the implication is that strategies that shard on unbacked dimensions are included for consideration, and if selected, can lead to uneven sharding/zero-size shards at runtime. Alternatives would be 1) the current state of things: DDE and force the user to pick a path: `torch._check(size of sharded dim < or >= num shards)`, or 2) assume the non-shardable case and never include sharded strategies, unless the user picks the shardable path. More discussion in #165034 (comment).
- Lastly, testing traced redistribution decisions required using aot_eager backend, so that the collectives/ops were hardcoded (eager backend would go through DTensor.dispatch again). This seemed to require re-enabling proxy tracking during shard prop, basically reverting #163126. Otherwise, errors like `RuntimeError: Max(1, u2) (<class 'torch.SymInt'>, 140294330350224)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7f98d1b14af0>` show up for DTensor outer strides...

cc ezyang EikanWang jgong5 wenzhe-nrv voznesenskym penguinwu Guobing-Chen XiaobingSuper zhuhaozhe blzheng jiayisunx kadeng chauhang amjames Lucaskabela jataylo chenyang78 H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
@pianpwk
Copy link
Contributor Author

pianpwk commented Dec 5, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

umechand-amd pushed a commit to ROCm/pytorch that referenced this pull request Dec 8, 2025
This allows compiling a matmul on 2 DTensors with fully unbacked sizes, when a zero-cost strategy is available.

Changes with the PR:
- `mark_unbacked()` would previously error on tensor subclasses; now for DTensors it allocates unbacked symbols for both inner & outer sizes. The main motivation here is for testing, so happy to tweak semantics. The unbacked binding search process also now matches on DTensor outer sizes.
- Selecting an op strategy in sharding propagation is based on minimal redistribution costs, and these costs are functions of tensor shapes, so can be unbacked expressions. This PR makes this process more unbacked-friendly, choosing negative or zero-cost strategies when they're available. When these "trivial" strategies aren't available, selection requires comparing unbacked costs, addressed in the next PR (with usage of fallback hints).
- For matmul strategies, sharding prop rules filter out strategies where the matmul inputs fail the `is_tensor_shardable` check on the given DeviceMesh. In eager, this filters out cases where `size of sharded dim < num shards`. In the compiled & unbacked case, we'll often encounter dim size `u_` where `u_` can be both larger and smaller than num shards. This PR assumes such cases are shardable by default, and the implication is that strategies that shard on unbacked dimensions are included for consideration, and if selected, can lead to uneven sharding/zero-size shards at runtime. Alternatives would be 1) the current state of things: DDE and force the user to pick a path: `torch._check(size of sharded dim < or >= num shards)`, or 2) assume the non-shardable case and never include sharded strategies, unless the user picks the shardable path. More discussion in pytorch#165034 (comment).
- Lastly, testing traced redistribution decisions required using aot_eager backend, so that the collectives/ops were hardcoded (eager backend would go through DTensor.dispatch again). This seemed to require re-enabling proxy tracking during shard prop, basically reverting pytorch#163126. Otherwise, errors like `RuntimeError: Max(1, u2) (<class 'torch.SymInt'>, 140294330350224)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7f98d1b14af0>` show up for DTensor outer strides...

Pull Request resolved: pytorch#168051
Approved by: https://github.com/laithsakka
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
This allows compiling a matmul on 2 DTensors with fully unbacked sizes, when a zero-cost strategy is available.

Changes with the PR:
- `mark_unbacked()` would previously error on tensor subclasses; now for DTensors it allocates unbacked symbols for both inner & outer sizes. The main motivation here is for testing, so happy to tweak semantics. The unbacked binding search process also now matches on DTensor outer sizes.
- Selecting an op strategy in sharding propagation is based on minimal redistribution costs, and these costs are functions of tensor shapes, so can be unbacked expressions. This PR makes this process more unbacked-friendly, choosing negative or zero-cost strategies when they're available. When these "trivial" strategies aren't available, selection requires comparing unbacked costs, addressed in the next PR (with usage of fallback hints).
- For matmul strategies, sharding prop rules filter out strategies where the matmul inputs fail the `is_tensor_shardable` check on the given DeviceMesh. In eager, this filters out cases where `size of sharded dim < num shards`. In the compiled & unbacked case, we'll often encounter dim size `u_` where `u_` can be both larger and smaller than num shards. This PR assumes such cases are shardable by default, and the implication is that strategies that shard on unbacked dimensions are included for consideration, and if selected, can lead to uneven sharding/zero-size shards at runtime. Alternatives would be 1) the current state of things: DDE and force the user to pick a path: `torch._check(size of sharded dim < or >= num shards)`, or 2) assume the non-shardable case and never include sharded strategies, unless the user picks the shardable path. More discussion in #165034 (comment).
- Lastly, testing traced redistribution decisions required using aot_eager backend, so that the collectives/ops were hardcoded (eager backend would go through DTensor.dispatch again). This seemed to require re-enabling proxy tracking during shard prop, basically reverting #163126. Otherwise, errors like `RuntimeError: Max(1, u2) (<class 'torch.SymInt'>, 140294330350224)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7f98d1b14af0>` show up for DTensor outer strides...

Pull Request resolved: #168051
Approved by: https://github.com/laithsakka
tiendatngcs pushed a commit to tiendatngcs/pytorch-Dec25 that referenced this pull request Dec 10, 2025
ghstack-source-id: a616b57
Pull Request resolved: pytorch/pytorch#168051
tiendatngcs pushed a commit to tiendatngcs/pytorch-Dec25 that referenced this pull request Dec 10, 2025
ghstack-source-id: d57d9c0
Pull Request resolved: pytorch/pytorch#168051
@github-actions github-actions bot deleted the gh/pianpwk/34/head branch January 5, 2026 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request fx Merged module: dynamo release notes: distributed (dtensor) release notes category release notes: fx release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants