[Bugfix] Fix port conflict by obtaining a list of open ports upfront #21894

minosfuture · 2025-07-30T03:43:42Z

Purpose

Mitigate #21638 by querying a list of open ports in the main process.

Race condition still persists, but much less likely, in the case where an open port is used by other processes after it is queried. In comparison, before this PR, it is betting on the master_port + 1 to be open.

Ideal solution would be locking this list of open ports for torch.distributed. It would require sophisticated changes, including changes to torch.distributed.

Test Plan

Verify and lm_eval dp+tp+ep serving with:

VLLM_USE_V1=1 VLLM_LOGGING_LEVEL=DEBUG VLLM_ALL2ALL_BACKEND=pplx VLLM_USE_DEEP_GEMM=1 vllm serve  meta-llama/Llama-4-Scout-17B-16E  --max_model_len 8192 --kv_cache_dtype fp8 --enable-expert-parallel --tensor-parallel-size 2 --data-parallel-size 4 --trust-remote-code --gpu-memory-utilization 0.9 --disable-log-requests --compilation-config '{"full_cuda_graph":true}' 2>&1 | tee ~/log/ep_`date +%Y%m%d_%H%M%S`.log

Test Result

local-chat-completions (model=meta-llama/Llama-4-Scout-17B-16E,base_url=http://127.0.0.1:8000/v1/chat/completions,num_concurrent=32), gen_kwargs: (None), limit: 200.0, num_fewshot: 5, batch_size: 1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.78	±	0.0294
		strict-match	5	exact_match	↑	0.83	±	0.0266

(Optional) Documentation Update

gemini-code-assist

Code Review

This pull request aims to fix port conflicts by pre-fetching a list of open ports. The overall approach is sound, but there are a few critical issues in the implementation that could lead to the same port being used multiple times, defeating the purpose of the change. I've identified issues in get_open_ports_list that could result in duplicate ports, and a logic error in get_next_dp_init_port that also leads to port reuse. Additionally, there's a minor docstring issue. Addressing these points will make the solution more robust.

vllm/config.py

gemini-code-assist · 2025-07-30T03:46:06Z

vllm/utils/__init__.py

The current implementation of get_open_ports_list may return a list with duplicate ports. This can happen if get_open_port() is called multiple times in a loop and the underlying OS returns the same ephemeral port because the socket is immediately closed and the port is released. This can lead to port conflicts, which this PR aims to solve.

To guarantee uniqueness, you can collect the ports in a set.

Suggested change

def get_open_ports_list(count: int = 5) -> list[int]:

"""Get a list of open ports."""

ports = []

for _ in range(count):

ports.append(get_open_port())

return ports

def get_open_ports_list(count: int = 5) -> list[int]:

"""Get a list of open ports."""

ports = set()

while len(ports) < count:

ports.add(get_open_port())

return list(ports)

This seems like a reasonable suggestion?

vllm/config.py

github-actions · 2025-07-30T05:25:47Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

simon-mo · 2025-07-30T20:16:36Z

I think the alternative approach is one get open port, we bind to it then do a rebind?

minosfuture · 2025-07-30T22:15:51Z

I think the alternative approach is one get open port, we bind to it then do a rebind?

@simon-mo I think this would only work in single process, but in this case we have worker processes that rendezvous on a shared port, which is specified by the main process and passed on via ParallelConfig. So I don't think binding + rebinding would work iiuc.
I actually tried reusing master port in TP workers. It raised port conflict error, because the port is used by DP processes already.

simon-mo · 2025-08-04T19:01:38Z

@njhill wdyt? I feel like this might be inflating the arg lists a bit but also not sure if there's a better option

njhill · 2025-08-05T18:03:46Z

@minosfuture a simpler "quick fix" may be to just add a line here:

        self.data_parallel_master_port = get_open_port()
        # This ensures subsequent calls to get_open_port() won't choose
        # from the range [dp_master_port, dp_master_port + 10).
        os.environ["VLLM_DP_MASTER_PORT"] = self.data_parallel_master_port

because there's already logic to "reserve" a range of ports, it's just currently only used in the offline case.

minosfuture · 2025-08-08T22:48:46Z

@minosfuture a simpler "quick fix" may be to just add a line here:
        self.data_parallel_master_port = get_open_port()
        # This ensures subsequent calls to get_open_port() won't choose
        # from the range [dp_master_port, dp_master_port + 10).
        os.environ["VLLM_DP_MASTER_PORT"] = self.data_parallel_master_port
because there's already logic to "reserve" a range of ports, it's just currently only used in the offline case.

@njhill the issue is that other processes (not vllm) may have taken the (data_parallel_master_port+1) port. I don't think reserving it from the current vllm process would work.
The assumption of current get_next_dp_init_port is that a few ports next to data_parallel_master_port would be open. This turns out to be invalid quite often.
This PR gets rid of this assumption and query a list of open ports instead.

mergify · 2025-08-12T00:14:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @minosfuture.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

simon-mo

can you clean up unrelated changes. the port list part LGTM, you need to make sure in the command what's the expected length of the list and what happen when the list is small but number of DP rank is big

simon-mo · 2025-08-12T22:56:27Z

.github/workflows/matchers/markdownlint.json

restore or merge main

Signed-off-by: Ming Yang <minos.future@gmail.com>

njhill

Thanks @minosfuture, and thanks for your patience!

yewentao256

LGTM, thanks for the work!

…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>

…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: root <xwq391974@alibaba-inc.com>

…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com>

…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com>

minosfuture requested review from WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth and youkaichao as code owners July 30, 2025 03:43

gemini-code-assist bot reviewed Jul 30, 2025

View reviewed changes

minosfuture force-pushed the fix_port_confliction branch from 8c3875d to 893abb9 Compare July 30, 2025 17:11

minosfuture requested review from alexm-redhat, comaniac, njhill and ywang96 as code owners August 12, 2025 00:13

mergify bot added the v1 label Aug 12, 2025

mergify bot added the needs-rebase label Aug 12, 2025

minosfuture force-pushed the fix_port_confliction branch from 5ca062f to 2c2031f Compare August 12, 2025 00:22

mergify bot added ci/build and removed needs-rebase labels Aug 12, 2025

simon-mo reviewed Aug 12, 2025

View reviewed changes

.github/workflows/matchers/markdownlint.json Outdated

Copy link

Collaborator

simon-mo Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restore or merge main

minosfuture reacted with thumbs up emoji

minosfuture force-pushed the fix_port_confliction branch from b065a5d to 3de1520 Compare August 12, 2025 23:02

minosfuture requested review from ProExpertProg and yewentao256 as code owners August 12, 2025 23:02

Make _data_parallel_master_port_list private

5a4f2d6

Signed-off-by: Ming Yang <minos.future@gmail.com>

minosfuture force-pushed the fix_port_confliction branch from b508e92 to 62e976a Compare August 21, 2025 03:07

mergify bot removed the needs-rebase label Aug 21, 2025

minosfuture added 2 commits August 20, 2025 20:36

Address comments

f9dbb80

Signed-off-by: Ming Yang <minos.future@gmail.com>

fix test; remove Optional type

fd3ca69

Signed-off-by: Ming Yang <minos.future@gmail.com>

minosfuture force-pushed the fix_port_confliction branch from 62e976a to fd3ca69 Compare August 21, 2025 03:40

njhill approved these changes Aug 21, 2025

View reviewed changes

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 21, 2025

yewentao256 approved these changes Aug 21, 2025

View reviewed changes

njhill merged commit 10f535c into vllm-project:main Aug 21, 2025
45 checks passed

djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Aug 21, 2025

[Bugfix] Fix port conflict by obtaining a list of open ports upfront (v…

3d3c649

…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Bugfix] Fix port conflict by obtaining a list of open ports upfront (v…

5b37d7f

…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[Bugfix] Fix port conflict by obtaining a list of open ports upfront (v…

42a291b

…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Bugfix] Fix port conflict by obtaining a list of open ports upfront (v…

07413f6

…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com>

mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025

[Bugfix] Fix port conflict by obtaining a list of open ports upfront (v…

2956bd1

…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025

[Bugfix] Fix port conflict by obtaining a list of open ports upfront (v…

94dadab

…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com>

minosfuture mentioned this pull request Sep 15, 2025

[Bug]: Occasional port EADDRINUSE with DP serve #21638

Closed

1 task

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Bugfix] Fix port conflict by obtaining a list of open ports upfront (v…

e264f3f

…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com>

-def get_open_ports_list(count: int = 5) -> list[int]:
-    """Get a list of open ports."""
-    ports = []
-    for _ in range(count):
-        ports.append(get_open_port())
-    return ports
+def get_open_ports_list(count: int = 5) -> list[int]:
+    """Get a list of open ports."""
+    ports = set()
+    while len(ports) < count:
+        ports.add(get_open_port())
+    return list(ports)

Uh oh!

[Bugfix] Fix port conflict by obtaining a list of open ports upfront #21894

[Bugfix] Fix port conflict by obtaining a list of open ports upfront #21894

Uh oh!

Conversation

minosfuture commented Jul 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

simon-mo commented Jul 30, 2025

Uh oh!

minosfuture commented Jul 30, 2025

Uh oh!

simon-mo commented Aug 4, 2025

Uh oh!

njhill commented Aug 5, 2025

Uh oh!

minosfuture commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Aug 12, 2025

Uh oh!

simon-mo left a comment

Choose a reason for hiding this comment

Uh oh!

simon-mo Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

minosfuture commented Jul 30, 2025 •

edited by github-actions bot

Loading

minosfuture commented Aug 8, 2025 •

edited

Loading