Skip to content

[DTensor] add explicit mode (ExplicitRedistributionContext)#166593

Closed
wconstab wants to merge 3 commits intogh/wconstab/448/basefrom
gh/wconstab/448/head
Closed

[DTensor] add explicit mode (ExplicitRedistributionContext)#166593
wconstab wants to merge 3 commits intogh/wconstab/448/basefrom
gh/wconstab/448/head

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Oct 29, 2025

Stack from ghstack (oldest at bottom):

usage:

dx = distribute_tensor(x, device_mesh, [Shard(0)])
dA = distribute_tensor(A, device_mesh, [Shard(0)])
with ExplicitRedistributionContext():
    with self.assertRaisesRegex(RuntimeError, "Implicit redistribution"):
        # Shard(0) @ Shard(0) requires a redistribution
        torch.matmul(dx, dA)

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k @pragupta @msaroufim @dcci

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166593

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9eab1bf with merge base 397d9fe (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab added a commit that referenced this pull request Oct 29, 2025
ghstack-source-id: 5466441
Pull Request resolved: #166593
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 29, 2025
@kwen2501
Copy link
Collaborator

kwen2501 commented Oct 30, 2025

This is at least better than using parallelize_module to achieve the parallelization of an inner op.
(Just by reading this statement you would probably sense where the mismatch is.)

An op is hard to target from user script, because unlike parameters, they don't have FQNs.

So a user would either have to use a graph based approach, where the ops become nodes (which you would have a handle to). But this approach requires the user to use graph passes to parallelize the program.

Or the user can pass through the coat of nn.Module, and directly interact with the ops eagerly, like you did.

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Nov 5, 2025
ghstack-source-id: 48269f6
Pull Request resolved: #166593
cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Nov 5, 2025
ghstack-source-id: fb6a16d
Pull Request resolved: #166593

st
@wconstab wconstab changed the title Prototype DTensor explicit mode [DTensor] add explicit mode (ExplicitRedistributionContext) Nov 6, 2025
@wconstab wconstab added the release notes: distributed (dtensor) release notes category label Nov 6, 2025
@wconstab wconstab requested review from ezyang and tianyu-l November 6, 2025 00:38
@wconstab
Copy link
Contributor Author

wconstab commented Nov 6, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 6, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pull bot pushed a commit to j3din00b/pytorch that referenced this pull request Nov 11, 2025
…torch#167370)

Also support nesting, enable/disable, and make the class use a
thread-local for storage so independent threads do not confuse each
other.

Pull Request resolved: pytorch#167370
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#166593
Silv3S pushed a commit to Silv3S/pytorch that referenced this pull request Nov 18, 2025
…166593)

usage:

```
dx = distribute_tensor(x, device_mesh, [Shard(0)])
dA = distribute_tensor(A, device_mesh, [Shard(0)])
with ExplicitRedistributionContext():
    with self.assertRaisesRegex(RuntimeError, "Implicit redistribution"):
        # Shard(0) @ Shard(0) requires a redistribution
        torch.matmul(dx, dA)
```

Pull Request resolved: pytorch#166593
Approved by: https://github.com/ezyang
Silv3S pushed a commit to Silv3S/pytorch that referenced this pull request Nov 18, 2025
…torch#167370)

Also support nesting, enable/disable, and make the class use a
thread-local for storage so independent threads do not confuse each
other.

Pull Request resolved: pytorch#167370
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#166593
@github-actions github-actions bot deleted the gh/wconstab/448/head branch December 7, 2025 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (dtensor) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants