[CP] Replace context_parallel context manager with functional APIs by fegin · Pull Request #164500 · pytorch/pytorch

fegin · 2025-10-02T20:42:02Z

Stack from ghstack (oldest at bottom):

context_parallel() being a context manager has annoyed users. Now that we plan to redesign CP's UX to explicitly ask users to:

Wrap the attention op into an nn.Module
Lift any buffers that are not sequence agnostic to input

We can replace context_parallel() with two functional APIs: _context_parallel_shard and _enable_context_parallel_dispatcher.

cc @H-Huang @awgu @wanchaol @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-10-02T20:42:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164500

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm][CI] Machines under the label linux.rocm.gpu.2 are undergoing maintenance.

✅ No Failures

As of commit 4bda542 with merge base d41aa18 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

XilunWu

Some nits, stamp to unblock us and we can come back later to address them.

test/distributed/tensor/test_attention.py

torch/distributed/tensor/experimental/_attention.py

tianyu-l

I somehow feel we don't need _enable_context_parallel_dispatcher as user-facing API.

torch/distributed/tensor/experimental/_attention.py

tianyu-l · 2025-10-04T01:18:20Z

test/distributed/tensor/test_attention.py

+            cp_q, cp_k, cp_v = _context_parallel_shard(
+                mesh, (cp_q, cp_k, cp_v), (seq_dim,) * 3
+            )
+            _enable_context_parallel_dispatcher(seq_dim, mesh)


It looks for now this is only for sdpa but not FlexAttention.

Can we put them in sdpa_cp.sdpa_input_fn, sdpa_cp.sdpa_output_fn? It's also safer that way.

I think the problem is that we cannot put _disable_context_parallel_dispatcher in sdpa_output_fn because we want to wait until the backward so that we can disable it. iirc, if we do something in the backward hook, it may cause graph break? I'm not sure.

Discussed offline, keep it for now. We need to think about a better way to integrate DTensor with SDPA

torch/distributed/tensor/experimental/_attention.py

[ghstack-poisoned]

Discussed offline, keep it for now. We need to think about a better way to integrate DTensor with SDPA

fegin · 2025-10-12T05:32:35Z

@pytorchbot merge

pytorchmergebot · 2025-10-12T05:34:27Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-10-12T11:32:58Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

fegin · 2025-10-13T06:28:06Z

@pytorchbot merge

pytorchmergebot · 2025-10-13T06:29:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

The custom op will fetch the required K and V. Currently, the forward pass is just an all-gather, and the backward pass is a reduce-scatter. While the logic is the same as all_gather_tensor_autograd, the custom op avoids the Autograd warning that wait_tensor() is registered to autograd. For the next step, we should explore how to interpolate the required communication based on the information from BlockMask. Pull Request resolved: #163185 Approved by: https://github.com/XilunWu ghstack dependencies: #162542, #164500

…5039) No logic change, just polish the docstrings, comments and remove unused variables Pull Request resolved: #165039 Approved by: https://github.com/XilunWu ghstack dependencies: #162542, #164500, #163185

…orch#165039) No logic change, just polish the docstrings, comments and remove unused variables Pull Request resolved: pytorch#165039 Approved by: https://github.com/XilunWu ghstack dependencies: pytorch#162542, pytorch#164500, pytorch#163185

…ytorch#164500) `context_parallel()` being a context manager has annoyed users. Now that we plan to redesign CP's UX to explicitly ask users to: 1. Wrap the attention op into an `nn.Module` 2. Lift any buffers that are not sequence agnostic to input We can replace `context_parallel()` with two functional APIs: `_context_parallel_shard` and `_enable_context_parallel_dispatcher`. Pull Request resolved: pytorch#164500 Approved by: https://github.com/XilunWu ghstack dependencies: pytorch#162542

…h#163185) The custom op will fetch the required K and V. Currently, the forward pass is just an all-gather, and the backward pass is a reduce-scatter. While the logic is the same as all_gather_tensor_autograd, the custom op avoids the Autograd warning that wait_tensor() is registered to autograd. For the next step, we should explore how to interpolate the required communication based on the information from BlockMask. Pull Request resolved: pytorch#163185 Approved by: https://github.com/XilunWu ghstack dependencies: pytorch#162542, pytorch#164500

…orch#165039) No logic change, just polish the docstrings, comments and remove unused variables Pull Request resolved: pytorch#165039 Approved by: https://github.com/XilunWu ghstack dependencies: pytorch#162542, pytorch#164500, pytorch#163185

Update

0569028

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Oct 2, 2025

This was referenced Oct 2, 2025

[PP] Let PP split BlockMask into micro-BlockMask #164111

Closed

[CP]Introduce ContextParallal plan for parallelize_module() #162542

Closed

[CP] Introduce flex_cp_forward custom op for FlexAttention CP #163185

Closed

Update

062ead5

[ghstack-poisoned]

fegin requested a review from XilunWu October 3, 2025 07:05

fegin changed the title ~~[CP] Implement _context_parallel_shard function to replace context_parallel context manager~~ [CP] Replace context_parallel context manager with functional APIs Oct 3, 2025

fegin added release notes: context parallel module: context parallel PyTorch Context Parallel labels Oct 3, 2025

fegin added 3 commits October 3, 2025 00:17

Update

2a4e307

[ghstack-poisoned]

Update

01c18c1

[ghstack-poisoned]

Update

12d41c9

[ghstack-poisoned]

XilunWu approved these changes Oct 3, 2025

View reviewed changes

tianyu-l previously requested changes Oct 4, 2025

View reviewed changes

XilunWu reviewed Oct 6, 2025

View reviewed changes

torch/distributed/tensor/experimental/_attention.py Outdated Show resolved Hide resolved

fegin added 4 commits October 6, 2025 22:09

Update

fcf4ef0

[ghstack-poisoned]

Update

b9726a3

[ghstack-poisoned]

Update

468bd03

[ghstack-poisoned]

Update

f32caac

[ghstack-poisoned]

fegin mentioned this pull request Oct 9, 2025

[CP][BE] Docstrings, comments polish and remove unused variables #165039

Closed

fegin added 2 commits October 9, 2025 13:30

Update

0ac370b

[ghstack-poisoned]

Update

226b825

[ghstack-poisoned]

fegin added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 9, 2025

fegin added 4 commits October 9, 2025 22:49

Update

45e54f6

[ghstack-poisoned]

Update

60c979b

[ghstack-poisoned]

Update

18f9b62

[ghstack-poisoned]

Update

4bda542

[ghstack-poisoned]

fegin requested a review from tianyu-l October 12, 2025 05:31

pytorch deleted a comment from pytorch-bot bot Oct 12, 2025

pytorchmergebot added the merging label Oct 12, 2025

pytorchmergebot added the Merged label Oct 13, 2025

pytorchmergebot closed this in 8461b63 Oct 13, 2025

pytorchmergebot removed the merging label Oct 13, 2025

github-actions bot deleted the gh/fegin/326/head branch November 13, 2025 02:17

Conversation

fegin commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164500

❗ 1 Active SEVs

✅ No Failures

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tianyu-l Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fegin commented Oct 12, 2025

Uh oh!

pytorchmergebot commented Oct 12, 2025

Merge started

Uh oh!

pytorchmergebot commented Oct 12, 2025

Uh oh!

fegin commented Oct 13, 2025

Uh oh!

pytorchmergebot commented Oct 13, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fegin commented Oct 2, 2025 •

edited

Loading

pytorch-bot bot commented Oct 2, 2025 •

edited

Loading