Skip to content

Conversation

@njhill
Copy link
Member

@njhill njhill commented Sep 11, 2025

It was broken by an incorrect assert introduced in #24265. The breakage was originally obscured by the other NCCL-related 4-GPU distributed tests CI breakage.

This fixes the 4-GPU distributed CI test.

Edit: It looks like the separate pipeline parallel test was still passing, I'm not sure what the difference is but the torchrun version of it was failing due to this.

Signed-off-by: Nick Hill <nhill@redhat.com>
@njhill njhill added the bug Something isn't working label Sep 11, 2025
@mergify mergify bot added the v1 label Sep 11, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several bug fixes related to pipeline parallelism and shutdown procedures. The changes include removing an incorrect assertion that caused failures in pipeline parallel setups, adding a defensive check to prevent errors during interpreter shutdown, and implementing a proper shutdown method for the UniProcExecutor. These fixes enhance the robustness and correctness of the distributed execution. The changes are well-implemented and address the described issues effectively.

Comment on lines +74 to +76
def shutdown(self) -> None:
if worker := self.driver_worker:
worker.shutdown()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These other changes are to fix None deref errors raised during interpreter shutdown due to things already having been garbage collected, a minor issue introduced with #22699.

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 11, 2025
@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) September 11, 2025 01:58
@DarkLight1337 DarkLight1337 added this to the v0.10.2 milestone Sep 11, 2025
@vllm-bot vllm-bot merged commit e2d8c27 into vllm-project:main Sep 11, 2025
52 of 54 checks passed
@njhill njhill deleted the fix-4gpu-pp branch September 11, 2025 14:13
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
Signed-off-by: Nick Hill <nhill@redhat.com>
dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
Signed-off-by: Nick Hill <nhill@redhat.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Nick Hill <nhill@redhat.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants