[ContextParallel] add process-time based Round-Robin load-balance to CP by XilunWu · Pull Request #163617 · pytorch/pytorch

XilunWu · 2025-09-23T06:57:24Z

Stack from ghstack (oldest at bottom):

Summary
The load-balancing problem can be modeled as identical-machines scheduling problem. We already provided an easy-to-extend interface in #161062 for
implementing load-balancing and in this PR we start with adding a Round-Robin solution as an example
and also a verification. This can be easily adapted to other solutions like Shortest-processing-time-first/
Longest-processing-time-first with extra padding added for collectives.

Added a new type of _LoadBalancer implementation _PTRRLoadBalancer which is designed for
flex_attention(). This load-balance strategy analyzes the BlockMask sparsity info and perform
Round-Robin (unlike traditional Round-Robin doing it in circular order, we do in zig-zag order).
Make _context_parallel_buffers and context_parallel_unshard handle batched load-balance
index (previously it can only handle non-batched load-balance index), like in create_cp_block_mask.

Test
pytest test/distributed/tensor/test_attention.py

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-09-23T06:57:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163617

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ebb2dcd with merge base b54e466 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 65f5728 Pull Request resolved: #163617

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: ea49f30 Pull Request resolved: #163617

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: aa19eee Pull Request resolved: #163617

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: 93121fc Pull Request resolved: #163617

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

todo: fix unshard problem ; add batch values to tests ghstack-source-id: 2bb794e Pull Request resolved: #163617

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: d42b6d6 Pull Request resolved: #163617

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

torch/distributed/tensor/experimental/_attention.py

torch/distributed/tensor/experimental/_load_balancer.py

drisspg · 2025-10-07T18:55:21Z

torch/distributed/tensor/experimental/_load_balancer.py

+        tasks_in_group, _ = torch.sort(tasks_in_group, dim=1)
+        return tasks_in_group
+
+    def _generate_indices(self, restore: bool = False) -> Tensor:


Very pretty :)

we dont have a DtoH sync here right?

no, all operations are tensor ops and happen on block_mask's device.

…balance to CP" **Summary** The load-balancing problem can be modeled as [identical-machines scheduling](https://en.wikipedia.org/wiki/Identical-machines_scheduling) problem. We already provided an easy-to-extend interface in #161062 for implementing load-balancing and in this PR we start with adding a Round-Robin solution as an example and also a verification. This can be easily adapted to other solutions like Shortest-processing-time-first/ Longest-processing-time-first with extra padding added for collectives. - Added a new type of `_LoadBalancer` implementation `_PTRRLoadBalancer` which is designed for `flex_attention()`. This load-balance strategy analyzes the `BlockMask` sparsity info and perform Round-Robin (unlike traditional Round-Robin doing it in circular order, we do in zig-zag order). - Make `_context_parallel_buffers` and `context_parallel_unshard` handle batched load-balance index (previously it can only handle non-batched load-balance index), like in `create_cp_block_mask`. **Test** `pytest test/distributed/tensor/test_attention.py` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

fegin

LGTM, please address the comments before merging the PR.

test/distributed/tensor/test_attention.py

torch/distributed/tensor/experimental/_attention.py

torch/distributed/tensor/experimental/_load_balancer.py

fegin · 2025-10-14T20:12:50Z

torch/distributed/tensor/experimental/_load_balancer.py

+
+        Warning:
+            For Multi-Head Attention, we require the masks over the head dimension are identical
+            (i.e. `self.block_mask` must have shape (B, 1, seq_len, seq_len) or (1, 1, seq_len, seq_len)).


We should add the check in __init__().

torch/distributed/tensor/experimental/_load_balancer.py

fegin · 2025-10-14T20:18:10Z

torch/distributed/tensor/experimental/_load_balancer.py

+        non_sparse_kv_num_blocks = (
+            kv_num_blocks + full_kv_num_blocks
+            if full_kv_num_blocks is not None
+            else kv_num_blocks
+        )
+        B, H, Q = non_sparse_kv_num_blocks.shape
+        # requirement: the masking is identical across heads (i.e. H == 1 in BlockMask)
+        non_sparse_kv_num_blocks = non_sparse_kv_num_blocks.view(-1, Q)  # (B, Q_BLK)
+
+        batch_ptrr = torch.vmap(
+            functools.partial(
+                _PTRRLoadBalancer.ptrr_scheduling,
+                group_size=self.world_size,
+            )
+        )
+        ptrr_indices = batch_ptrr(
+            non_sparse_kv_num_blocks
+        )  # (B, group_size, num_blks_in_group)
+        ptrr_indices = ptrr_indices.reshape(B, -1)  # (B, num_blocks)
+
+        # NOTE: only support the case where the qkv block size are equal
+        q_blk_size, kv_blk_size = block_mask.BLOCK_SIZE
+        assert q_blk_size == kv_blk_size, (
+            "for now only support q_blk_size == kv_blk_size"
+        )
+
+        indices = torch.arange(
+            q_blk_size * ptrr_indices.size(1), device=ptrr_indices.device
+        ).view(-1, q_blk_size)  # (NUM_BLOCKS, BLOCK_SIZE)
+        indices = indices[ptrr_indices].view(B, -1)  # (B, qkv_size)
+
+        if restore:
+            indices = torch.vmap(torch.argsort)(indices)


I'm thinking that should we put the logic to a separate function? The main reason is that I am worried about the performance indication and am thinking if we should compile the code.

Yes, simply add @torch.compile(fullgraph=True) to the function definition. But we'll need to remove "raise" and "assert" since they would break graph when hit.

…balance to CP" **Summary** The load-balancing problem can be modeled as [identical-machines scheduling](https://en.wikipedia.org/wiki/Identical-machines_scheduling) problem. We already provided an easy-to-extend interface in #161062 for implementing load-balancing and in this PR we start with adding a Round-Robin solution as an example and also a verification. This can be easily adapted to other solutions like Shortest-processing-time-first/ Longest-processing-time-first with extra padding added for collectives. - Added a new type of `_LoadBalancer` implementation `_PTRRLoadBalancer` which is designed for `flex_attention()`. This load-balance strategy analyzes the `BlockMask` sparsity info and perform Round-Robin (unlike traditional Round-Robin doing it in circular order, we do in zig-zag order). - Make `_context_parallel_buffers` and `context_parallel_unshard` handle batched load-balance index (previously it can only handle non-batched load-balance index), like in `create_cp_block_mask`. **Test** `pytest test/distributed/tensor/test_attention.py` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

XilunWu · 2025-10-15T22:56:30Z

@pytorchbot merge

pytorchmergebot · 2025-10-15T22:58:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

#163617 removes the if/else statement to check if the input buffers have the batch dimension. This PR fixes the issue and also adds a test. In the future, we should explicitly ask users to unsqueeze the batch dimension. This is a BC of the existing contract but implicitly infers the batch dimension existence is not safe. ghstack-source-id: e441157 Pull-Request: #165792

#163617 removes the if/else statement to check if the input buffers have the batch dimension. This PR fixes the issue and also adds a test. In the future, we should explicitly ask users to unsqueeze the batch dimension. This is a BC of the existing contract but implicitly infers the batch dimension existence is not safe. ghstack-source-id: 321161d Pull-Request: #165792

…165792) #163617 removes the if/else statement to check if the input buffers have the batch dimension. This PR fixes the issue and also adds a test. In the future, we should explicitly ask users to unsqueeze the batch dimension. This is a BC of the existing contract but implicitly infers the batch dimension existence is not safe. Pull Request resolved: #165792 Approved by: https://github.com/XilunWu

…CP (pytorch#163617) **Summary** The load-balancing problem can be modeled as [identical-machines scheduling](https://en.wikipedia.org/wiki/Identical-machines_scheduling) problem. We already provided an easy-to-extend interface in pytorch#161062 for implementing load-balancing and in this PR we start with adding a Round-Robin solution as an example and also a verification. This can be easily adapted to other solutions like Shortest-processing-time-first/ Longest-processing-time-first with extra padding added for collectives. - Added a new type of `_LoadBalancer` implementation `_PTRRLoadBalancer` which is designed for `flex_attention()`. This load-balance strategy analyzes the `BlockMask` sparsity info and perform Round-Robin (unlike traditional Round-Robin doing it in circular order, we do in zig-zag order). - Make `_context_parallel_buffers` and `context_parallel_unshard` handle batched load-balance index (previously it can only handle non-batched load-balance index), like in `create_cp_block_mask`. **Test** `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: pytorch#163617 Approved by: https://github.com/fegin

…ytorch#165792) pytorch#163617 removes the if/else statement to check if the input buffers have the batch dimension. This PR fixes the issue and also adds a test. In the future, we should explicitly ask users to unsqueeze the batch dimension. This is a BC of the existing contract but implicitly infers the batch dimension existence is not safe. Pull Request resolved: pytorch#165792 Approved by: https://github.com/XilunWu

ghstack-source-id: 970f2cc Pull Request resolved: pytorch/pytorch#163617

[ContextParallel] add process-time based Round-Robin load-balance to CP

92a4d06

[ghstack-poisoned]

XilunWu mentioned this pull request Sep 23, 2025

[ContextParallel] add _LoadBalancer classes, and load-balance interface to Context Parallel APIs #161062

Closed

XilunWu mentioned this pull request Sep 18, 2025

[ContextParallel][example] FLOPS measure #163053

Closed

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 23, 2025

XilunWu added a commit that referenced this pull request Sep 23, 2025

[ContextParallel] add process-time based Round-Robin load-balance to CP

67e3a51

ghstack-source-id: 65f5728 Pull Request resolved: #163617

XilunWu marked this pull request as draft September 23, 2025 06:58

Update on "[ContextParallel] add process-time based Round-Robin load-…

96c50e4

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

XilunWu added a commit that referenced this pull request Sep 23, 2025

[ContextParallel] add process-time based Round-Robin load-balance to CP

d67daae

ghstack-source-id: ea49f30 Pull Request resolved: #163617

Update on "[ContextParallel] add process-time based Round-Robin load-…

d778a35

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

XilunWu added a commit that referenced this pull request Sep 24, 2025

[ContextParallel] add process-time based Round-Robin load-balance to CP

09d6602

ghstack-source-id: aa19eee Pull Request resolved: #163617

Update on "[ContextParallel] add process-time based Round-Robin load-…

097d737

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

XilunWu added a commit that referenced this pull request Sep 24, 2025

[ContextParallel] add process-time based Round-Robin load-balance to CP

12565eb

ghstack-source-id: 93121fc Pull Request resolved: #163617

Update on "[ContextParallel] add process-time based Round-Robin load-…

740778d

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

XilunWu added a commit that referenced this pull request Sep 24, 2025

[ContextParallel] add process-time based Round-Robin load-balance to CP

76f15cd

todo: fix unshard problem ; add batch values to tests ghstack-source-id: 2bb794e Pull Request resolved: #163617

Update on "[ContextParallel] add process-time based Round-Robin load-…

5dbedf4

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

XilunWu added a commit that referenced this pull request Sep 24, 2025

[ContextParallel] add process-time based Round-Robin load-balance to CP

aec091a

ghstack-source-id: d42b6d6 Pull Request resolved: #163617

XilunWu mentioned this pull request Sep 24, 2025

[ContextParallel][benchmark] Add sparsity benchmark over different load-balance policies #163806

Closed

XilunWu added 3 commits September 28, 2025 23:36

Update on "[ContextParallel] add process-time based Round-Robin load-…

ed64048

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

Update on "[ContextParallel] add process-time based Round-Robin load-…

52d9c99

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

Update on "[ContextParallel] add process-time based Round-Robin load-…

11b3fb3

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

XilunWu requested a review from fegin October 7, 2025 08:07

Update on "[ContextParallel] add process-time based Round-Robin load-…

392b5bd

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

XilunWu requested a review from drisspg October 7, 2025 08:07

Update on "[ContextParallel] add process-time based Round-Robin load-…

b20b8eb

…balance to CP" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

XilunWu added module: context parallel PyTorch Context Parallel release notes: context parallel labels Oct 7, 2025

XilunWu marked this pull request as ready for review October 7, 2025 17:40

drisspg reviewed Oct 7, 2025

View reviewed changes

torch/distributed/tensor/experimental/_attention.py Outdated Show resolved Hide resolved

drisspg reviewed Oct 7, 2025

View reviewed changes

torch/distributed/tensor/experimental/_load_balancer.py Outdated Show resolved Hide resolved

drisspg reviewed Oct 7, 2025

View reviewed changes

XilunWu requested a review from drisspg October 7, 2025 22:48

XilunWu mentioned this pull request Oct 8, 2025

[ContextParallel][benchmark] Add sparsity against run time benchmark for flex_attention() #164919

Closed

XilunWu mentioned this pull request Oct 11, 2025

[CP][BE] fix comments #165222

Closed

fegin approved these changes Oct 14, 2025

View reviewed changes

XilunWu added 2 commits October 14, 2025 14:15

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 15, 2025

pytorchmergebot added the merging label Oct 15, 2025

pytorchmergebot added the Merged label Oct 16, 2025

pytorchmergebot closed this in 12fa419 Oct 16, 2025

pytorchmergebot removed the merging label Oct 16, 2025

fegin mentioned this pull request Oct 17, 2025

[CP] Fix load balancer incorrectly assuming batch dimension exists #165792

Closed

github-actions bot deleted the gh/XilunWu/172/head branch November 16, 2025 02:21

Khanaksahu pushed a commit to Khanaksahu/pytorch-fork that referenced this pull request Nov 17, 2025

[ContextParallel] add process-time based Round-Robin load-balance to CP

58be776

ghstack-source-id: 970f2cc Pull Request resolved: pytorch/pytorch#163617

Conversation

XilunWu commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163617

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

drisspg Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

XilunWu Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fegin Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fegin Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

XilunWu Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

XilunWu commented Oct 15, 2025

Uh oh!

pytorchmergebot commented Oct 15, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

XilunWu commented Sep 23, 2025 •

edited

Loading

pytorch-bot bot commented Sep 23, 2025 •

edited

Loading