[DTensor] unbacked matmuls for no-redistribute case by pianpwk · Pull Request #168051 · pytorch/pytorch

pianpwk · 2025-11-18T00:09:31Z

Stack from ghstack (oldest at bottom):

This allows compiling a matmul on 2 DTensors with fully unbacked sizes, when a zero-cost strategy is available.

Changes with the PR:

mark_unbacked() would previously error on tensor subclasses; now for DTensors it allocates unbacked symbols for both inner & outer sizes. The main motivation here is for testing, so happy to tweak semantics. The unbacked binding search process also now matches on DTensor outer sizes.
Selecting an op strategy in sharding propagation is based on minimal redistribution costs, and these costs are functions of tensor shapes, so can be unbacked expressions. This PR makes this process more unbacked-friendly, choosing negative or zero-cost strategies when they're available. When these "trivial" strategies aren't available, selection requires comparing unbacked costs, addressed in the next PR (with usage of fallback hints).
For matmul strategies, sharding prop rules filter out strategies where the matmul inputs fail the is_tensor_shardable check on the given DeviceMesh. In eager, this filters out cases where size of sharded dim < num shards. In the compiled & unbacked case, we'll often encounter dim size u_ where u_ can be both larger and smaller than num shards. This PR assumes such cases are shardable by default, and the implication is that strategies that shard on unbacked dimensions are included for consideration, and if selected, can lead to uneven sharding/zero-size shards at runtime. Alternatives would be 1) the current state of things: DDE and force the user to pick a path: torch._check(size of sharded dim < or >= num shards), or 2) assume the non-shardable case and never include sharded strategies, unless the user picks the shardable path. More discussion in DTensor Matmul Compile with Unbacked Symint Failure #165034 (comment).
Lastly, testing traced redistribution decisions required using aot_eager backend, so that the collectives/ops were hardcoded (eager backend would go through DTensor.dispatch again). This seemed to require re-enabling proxy tracking during shard prop, basically reverting [dtensor][compile] Disable proxy mode in sharding prop rules #163126. Otherwise, errors like RuntimeError: Max(1, u2) (<class 'torch.SymInt'>, 140294330350224)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7f98d1b14af0> show up for DTensor outer strides...

cc @ezyang @EikanWang @jgong5 @wenzhe-nrv @voznesenskym @penguinwu @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @jiayisunx @kadeng @chauhang @amjames @Lucaskabela @jataylo @chenyang78 @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-11-18T00:09:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168051

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 6dd338b with merge base 7b7af39 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh) (similar failure)
RuntimeError: inductor/test_collective_autotuning 1/1 failed!

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3.14-clang12 / test (dynamo_wrapped, 3, 3, lf.linux.2xlarge, unstable) (gh)
test/test_autograd.py::TestAutograd::test_custom_function_saving_mutated_view_no_leak

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2025-11-18T00:09:41Z