-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
[Misc] Support bench serve long context #24373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Support bench serve long context #24373
Conversation
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a 'chunk too big' error in benchmarks with long prompts by switching to response.content.iter_any() for streaming. The introduction of StreamedResponseHandler is a good approach to manage chunked SSE messages. My review focuses on ensuring the robustness of this new handler and maintaining clean benchmark output. I've identified a critical issue in the stream handling logic that could lead to incorrect message processing and a high-severity issue regarding a debug print statement that should be removed.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ming Yang <minos.future@gmail.com>
ywang96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two comments - PTAL
I also wonder if we can simply fix it by modifying read_bufsize of aiohttp session? That's probably cleaner so that we don't need to maintain this customer handler ourself. I believe @MengqingCao has already attempted that
| AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60) | ||
|
|
||
|
|
||
| class StreamedResponseHandler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also modify async_request_openai_chat_completions to use this handler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Signed-off-by: Ming Yang <minos.future@gmail.com>
Updated! thx! |
That's a good point - let's stick with this then |
Thanks for this work! I agree that we maintain a custom handler, as actually users should modify the context length if the approximate Just a small suggestion, I think |
Signed-off-by: Ming Yang <minos.future@gmail.com>
Head branch was pushed to by a user without write access
thx! updated is async_request_openai_audio covered by CI? if not, @ywang96 @MengqingCao could you suggest how to test this (example model/bench serve cmd)? The change is the same as other functions, but I want to give it a sanity check. |
I checked the test cases on BTW, for plz refer to https://github.com/vllm-project/vllm/blob/main/tests/benchmarks/test_serve_cli.py |
Signed-off-by: Ming Yang <minos.future@gmail.com>
added a test for chat completions. |
No worries - we typically don't test benchmark scripts anyways in CI. We can get this in. |
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Before this change, we'd hit the following "chunk too big" error when running benchmark with long prompt.
This PR changes the response streaming into iterations (from
response.contenttoresponse.content.iter_any()). With this, we won't hit the error any more but need to take care of the following in this PR:Test Plan
verify both long and short prompt work
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.