[BugFix] Async scheduling and PP compatibility with DP #23770

njhill · 2025-08-27T20:27:42Z

Also fixes issue with finished requests not being processed in async scheduling and PP cases, and possible worker race conditions.

Should resolve/replace:

Some possible follow-on work, out of scope for this PR:

Avoid busy cpu spin in worker response waiting thread
Single reactor loop for engine core - wait on input queue and worker response at the same time
Uniproc async scheduling for TP=1

gemini-code-assist

Code Review

This pull request introduces several fixes for async scheduling and pipeline parallelism with data parallelism. The main changes involve refactoring the batch queue handling in EngineCore from queue.Queue to collections.deque and adjusting the logic in step_with_batch_queue to better support pipelining. It also adds a fix in ray_utils for pipeline parallelism and extends tests for async scheduling. While the changes are mostly correct, I've found a critical issue in the updated step_with_batch_queue logic that could lead to incorrect behavior in data parallel mode.

vllm/v1/engine/core.py

Also fixes issue with finished requests not being processed in async scheduling and PP cases. Signed-off-by: Nick Hill <nhill@redhat.com>

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill · 2025-08-29T00:49:56Z

vllm/v1/engine/core.py

            logger.info("Batch queue is enabled with size %d",
                        self.batch_queue_size)
-            self.batch_queue = queue.Queue(self.batch_queue_size)
+            self.batch_queue = deque(maxlen=self.batch_queue_size)


The queue is accessed only from the core loop thread so does not need to be threadsafe/blocking.

njhill · 2025-08-29T01:43:34Z

@WoosukKwon I think this is ready to review. We're also testing it in wide-ep deployment.

vllm/v1/engine/core.py

WoosukKwon · 2025-08-29T01:48:34Z

vllm/v1/engine/core.py

+            # Queue is empty. We should not reach here since this method should
+            # only be called when the scheduler contains requests or the queue
+            # is non-empty.
+            return None, False


Do you mean it's a bug if we reach this code? Should we do assert if that's the case?

I'm not sure, it's still a valid state and by returning this things stay correct/consistent. So it seems reasonable to keep it like this in terms of the method's behaviour.

The condition that means we shouldn't reach here is outside of this method (in the main core engine loop).

vllm/v1/engine/core.py

WoosukKwon

LGTM. Amazing!

…23770) Signed-off-by: Nick Hill <nhill@redhat.com>

### What this PR does / why we need it? based on the vllm-project/vllm#23770, fix Async scheduling and PP compatibility with DP, also fixes issue with finished requests not being processed in async scheduling and PP cases, and possible worker race conditions. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@544fe76 --------- Signed-off-by: jesse <szxfml@gmail.com>

…23770) Signed-off-by: Nick Hill <nhill@redhat.com>

…2796) ### What this PR does / why we need it? based on the vllm-project/vllm#23770, fix Async scheduling and PP compatibility with DP, also fixes issue with finished requests not being processed in async scheduling and PP cases, and possible worker race conditions. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@544fe76 --------- Signed-off-by: jesse <szxfml@gmail.com>

…2796) ### What this PR does / why we need it? based on the vllm-project/vllm#23770, fix Async scheduling and PP compatibility with DP, also fixes issue with finished requests not being processed in async scheduling and PP cases, and possible worker race conditions. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@544fe76 --------- Signed-off-by: jesse <szxfml@gmail.com> Signed-off-by: hwhaokun <haokun0405@163.com>

…2796) ### What this PR does / why we need it? based on the vllm-project/vllm#23770, fix Async scheduling and PP compatibility with DP, also fixes issue with finished requests not being processed in async scheduling and PP cases, and possible worker race conditions. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@544fe76 --------- Signed-off-by: jesse <szxfml@gmail.com> Signed-off-by: nsdie <yeyifan@huawei.com>

…2796) ### What this PR does / why we need it? based on the vllm-project/vllm#23770, fix Async scheduling and PP compatibility with DP, also fixes issue with finished requests not being processed in async scheduling and PP cases, and possible worker race conditions. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@544fe76 --------- Signed-off-by: jesse <szxfml@gmail.com>

njhill requested review from WoosukKwon, alexm-redhat, comaniac, robertgshaw2-redhat and ywang96 as code owners August 27, 2025 20:27

mergify bot added the v1 label Aug 27, 2025

gemini-code-assist bot reviewed Aug 27, 2025

View reviewed changes

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

njhill marked this pull request as draft August 28, 2025 00:17

[BugFix] Async scheduling and PP compatibility with DP

2833ab6

Also fixes issue with finished requests not being processed in async scheduling and PP cases. Signed-off-by: Nick Hill <nhill@redhat.com>

njhill force-pushed the async-sched-dp branch from 8042c0f to 2833ab6 Compare August 28, 2025 00:30

njhill marked this pull request as ready for review August 28, 2025 00:31

njhill marked this pull request as draft August 28, 2025 07:04

njhill added 2 commits August 28, 2025 12:35

Merge remote-tracking branch 'origin/main' into async-sched-dp

e916f5b

fix dummy calls

a1d8771

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill marked this pull request as ready for review August 28, 2025 20:30

njhill mentioned this pull request Aug 28, 2025

[Bug]: illegal memory access when there are multiple concurrent request #23814

Open

1 task

minor optimization for DP+TP

1438909

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 29, 2025

njhill added 2 commits August 28, 2025 18:14

Merge remote-tracking branch 'origin/main' into async-sched-dp

4462ecc

add some comments

344a3c5

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill commented Aug 29, 2025

View reviewed changes

WoosukKwon reviewed Aug 29, 2025

View reviewed changes

vllm/v1/engine/core.py Show resolved Hide resolved

WoosukKwon reviewed Aug 29, 2025

View reviewed changes

vllm/v1/engine/core.py Show resolved Hide resolved

WoosukKwon approved these changes Aug 29, 2025

View reviewed changes

WoosukKwon merged commit d90d8eb into vllm-project:main Aug 29, 2025
39 of 41 checks passed

njhill deleted the async-sched-dp branch August 29, 2025 15:34

This was referenced Aug 29, 2025

Make async scheduling compatible with DP #21244

Closed

[Bugfix][V1] Fix Finished Request Handling in Async Scheduling #22543

Closed

njhill mentioned this pull request Aug 29, 2025

[P/D][Bugfix]: Fix the issue where the remote KVCache cannot be loaded when PP > 1 #19558

Closed

chaunceyjiang mentioned this pull request Sep 3, 2025

[BugFix] pp cannot run successfully under NixlConnector #22976

Closed

4 tasks

jesse996 mentioned this pull request Sep 8, 2025

[BugFix] Async scheduling and PP compatibility with DP vllm-project/vllm-ascend#2796

Merged

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025

[BugFix] Async scheduling and PP compatibility with DP (vllm-project#…

24ef97b

…23770) Signed-off-by: Nick Hill <nhill@redhat.com>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[BugFix] Async scheduling and PP compatibility with DP (vllm-project#…

d264809

…23770) Signed-off-by: Nick Hill <nhill@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] Async scheduling and PP compatibility with DP #23770

[BugFix] Async scheduling and PP compatibility with DP #23770

Uh oh!

njhill commented Aug 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

njhill Aug 29, 2025

Uh oh!

njhill commented Aug 29, 2025

Uh oh!

Uh oh!

WoosukKwon Aug 29, 2025

Uh oh!

njhill Aug 29, 2025

Uh oh!

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[BugFix] Async scheduling and PP compatibility with DP #23770

[BugFix] Async scheduling and PP compatibility with DP #23770

Uh oh!

Conversation

njhill commented Aug 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

njhill Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

njhill commented Aug 29, 2025

Uh oh!

Uh oh!

WoosukKwon Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

njhill commented Aug 27, 2025 •

edited by github-actions bot

Loading