-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
[Bugfix] Fix port conflict by obtaining a list of open ports upfront #21894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix port conflicts by pre-fetching a list of open ports. The overall approach is sound, but there are a few critical issues in the implementation that could lead to the same port being used multiple times, defeating the purpose of the change. I've identified issues in get_open_ports_list that could result in duplicate ports, and a logic error in get_next_dp_init_port that also leads to port reuse. Additionally, there's a minor docstring issue. Addressing these points will make the solution more robust.
vllm/utils/__init__.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of get_open_ports_list may return a list with duplicate ports. This can happen if get_open_port() is called multiple times in a loop and the underlying OS returns the same ephemeral port because the socket is immediately closed and the port is released. This can lead to port conflicts, which this PR aims to solve.
To guarantee uniqueness, you can collect the ports in a set.
| def get_open_ports_list(count: int = 5) -> list[int]: | |
| """Get a list of open ports.""" | |
| ports = [] | |
| for _ in range(count): | |
| ports.append(get_open_port()) | |
| return ports | |
| def get_open_ports_list(count: int = 5) -> list[int]: | |
| """Get a list of open ports.""" | |
| ports = set() | |
| while len(ports) < count: | |
| ports.add(get_open_port()) | |
| return list(ports) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a reasonable suggestion?
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
8c3875d to
893abb9
Compare
|
I think the alternative approach is one get open port, we bind to it then do a rebind? |
@simon-mo I think this would only work in single process, but in this case we have worker processes that rendezvous on a shared port, which is specified by the main process and passed on via ParallelConfig. So I don't think binding + rebinding would work iiuc. |
|
@njhill wdyt? I feel like this might be inflating the arg lists a bit but also not sure if there's a better option |
|
@minosfuture a simpler "quick fix" may be to just add a line here: self.data_parallel_master_port = get_open_port()
# This ensures subsequent calls to get_open_port() won't choose
# from the range [dp_master_port, dp_master_port + 10).
os.environ["VLLM_DP_MASTER_PORT"] = self.data_parallel_master_portbecause there's already logic to "reserve" a range of ports, it's just currently only used in the offline case. |
@njhill the issue is that other processes (not vllm) may have taken the (data_parallel_master_port+1) port. I don't think reserving it from the current vllm process would work. |
|
This pull request has merge conflicts that must be resolved before it can be |
5ca062f to
2c2031f
Compare
simon-mo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you clean up unrelated changes. the port list part LGTM, you need to make sure in the command what's the expected length of the list and what happen when the list is small but number of DP rank is big
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restore or merge main
b065a5d to
3de1520
Compare
Signed-off-by: Ming Yang <minos.future@gmail.com>
b508e92 to
62e976a
Compare
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
62e976a to
fd3ca69
Compare
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @minosfuture, and thanks for your patience!
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the work!
…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>
…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: root <xwq391974@alibaba-inc.com>
…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com>
…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com>
…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com>
…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com>
…llm-project#21894) Signed-off-by: Ming Yang <minos.future@gmail.com>
Purpose
Mitigate #21638 by querying a list of open ports in the main process.
Race condition still persists, but much less likely, in the case where an open port is used by other processes after it is queried. In comparison, before this PR, it is betting on the
master_port + 1to be open.Ideal solution would be locking this list of open ports for torch.distributed. It would require sophisticated changes, including changes to torch.distributed.
Test Plan
Verify and lm_eval dp+tp+ep serving with:
Test Result
local-chat-completions (model=meta-llama/Llama-4-Scout-17B-16E,base_url=http://127.0.0.1:8000/v1/chat/completions,num_concurrent=32), gen_kwargs: (None), limit: 200.0, num_fewshot: 5, batch_size: 1
(Optional) Documentation Update