[Misc] Support bench serve long context #24373

minosfuture · 2025-09-06T18:07:16Z

Purpose

Before this change, we'd hit the following "chunk too big" error when running benchmark with long prompt.

This PR changes the response streaming into iterations (from response.content to response.content.iter_any()). With this, we won't hit the error any more but need to take care of the following in this PR:

multiple messages may be received together, so we need to split them by '\n\n'.
incomplete message may be received, so we need to accumulate chunks until we see complete message.

ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Traceback (most recent call last):
  File "/data/users/yming/gitrepos/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 105, in async_request_openai_completions
    async for chunk_bytes in response.content:
  File "/home/yming/uv_env/vllm/lib64/python3.12/site-packages/aiohttp/streams.py", line 50, in __anext__
    rv = await self.read_func()
         ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yming/uv_env/vllm/lib64/python3.12/site-packages/aiohttp/streams.py", line 321, in readline
    return await self.readuntil()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yming/uv_env/vllm/lib64/python3.12/site-packages/aiohttp/streams.py", line 349, in readuntil
    raise ValueError("Chunk too big")
ValueError: Chunk too big

Test Plan

verify both long and short prompt work

Test Result

============ Serving Benchmark Result ============
Successful requests:                     1
Maximum request concurrency:             1
Benchmark duration (s):                  33.71
Total input tokens:                      204800
Total generated tokens:                  1024
Request throughput (req/s):              0.03
Output token throughput (tok/s):         30.37
Total Token throughput (tok/s):          6105.28
---------------Time to First Token----------------
Mean TTFT (ms):                          1147.69
Median TTFT (ms):                        1147.69
P99 TTFT (ms):                           1147.69
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          31.83
Median TPOT (ms):                        31.83
P99 TPOT (ms):                           31.83
---------------Inter-token Latency----------------
Mean ITL (ms):                           31.83
Median ITL (ms):                         31.73
P99 ITL (ms):                            33.58
==================================================

============ Serving Benchmark Result ============
Successful requests:                     1
Maximum request concurrency:             1
Benchmark duration (s):                  0.22
Total input tokens:                      32
Total generated tokens:                  8
Request throughput (req/s):              4.59
Output token throughput (tok/s):         36.69
Total Token throughput (tok/s):          183.44
---------------Time to First Token----------------
Mean TTFT (ms):                          43.37
Median TTFT (ms):                        43.37
P99 TTFT (ms):                           43.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.88
Median TPOT (ms):                        24.88
P99 TPOT (ms):                           24.88
---------------Inter-token Latency----------------
Mean ITL (ms):                           24.88
Median ITL (ms):                         24.71
P99 ITL (ms):                            25.66
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Ming Yang <minos.future@gmail.com>

gemini-code-assist

Code Review

This pull request addresses a 'chunk too big' error in benchmarks with long prompts by switching to response.content.iter_any() for streaming. The introduction of StreamedResponseHandler is a good approach to manage chunked SSE messages. My review focuses on ensuring the robustness of this new handler and maintaining clean benchmark output. I've identified a critical issue in the stream handling logic that could lead to incorrect message processing and a high-severity issue regarding a debug print statement that should be removed.

vllm/benchmarks/lib/endpoint_request_func.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ming Yang <minos.future@gmail.com>

ywang96

I left two comments - PTAL

I also wonder if we can simply fix it by modifying read_bufsize of aiohttp session? That's probably cleaner so that we don't need to maintain this customer handler ourself. I believe @MengqingCao has already attempted that

vllm/benchmarks/lib/endpoint_request_func.py

ywang96 · 2025-09-07T03:17:17Z

vllm/benchmarks/lib/endpoint_request_func.py

 AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)


+class StreamedResponseHandler:


Can you also modify async_request_openai_chat_completions to use this handler?

Signed-off-by: Ming Yang <minos.future@gmail.com>

minosfuture · 2025-09-07T05:45:43Z

I left two comments - PTAL

I also wonder if we can simply fix it by modifying read_bufsize of aiohttp session? That's probably cleaner so that we don't need to maintain this customer handler ourself. I believe @MengqingCao has already attempted that

Updated! thx!
I think it would be better to have this handler that frees us from maintaining a proper buf size once for all. It should work with any context length.

ywang96 · 2025-09-07T09:00:34Z

I left two comments - PTAL
I also wonder if we can simply fix it by modifying read_bufsize of aiohttp session? That's probably cleaner so that we don't need to maintain this customer handler ourself. I believe @MengqingCao has already attempted that

Updated! thx! I think it would be better to have this handler that frees us from maintaining a proper buf size once for all. It should work with any context length.

That's a good point - let's stick with this then

MengqingCao · 2025-09-07T13:10:00Z

Updated! thx! I think it would be better to have this handler that frees us from maintaining a proper buf size once for all. It should work with any context length.

Thanks for this work! I agree that we maintain a custom handler, as actually users should modify the context length if the approximate read_bufsize in #24333 is not enough for them.

Just a small suggestion, I think async_request_openai_audio and async_request_openai_embeddings could also use this handler? and so as the funcs in benchmarks/backend_request_func.py

Signed-off-by: Ming Yang <minos.future@gmail.com>

minosfuture · 2025-09-07T17:33:09Z

Updated! thx! I think it would be better to have this handler that frees us from maintaining a proper buf size once for all. It should work with any context length.

Thanks for this work! I agree that we maintain a custom handler, as actually users should modify the context length if the approximate read_bufsize in #24333 is not enough for them.

Just a small suggestion, I think async_request_openai_audio and async_request_openai_embeddings could also use this handler? and so as the funcs in benchmarks/backend_request_func.py

thx! updated async_request_openai_audio.
async_request_openai_embeddings is not checking response.content.
I'll leave benchmarks/backend_request_func.py as is for now since it'll be deprecated.

is async_request_openai_audio covered by CI? if not, @ywang96 @MengqingCao could you suggest how to test this (example model/bench serve cmd)? The change is the same as other functions, but I want to give it a sanity check.

MengqingCao · 2025-09-08T06:39:07Z

is async_request_openai_audio covered by CI? if not, @ywang96 @MengqingCao could you suggest how to test this (example model/bench serve cmd)? The change is the same as other functions, but I want to give it a sanity check.

I checked the test cases on vllm bench, and there is no cases for other endpoints except for openai. maybe we could add test of openai-chat and openai-audio endpoint-type for sanity check.

BTW, for openai, the default endpoint-type of vllm bench serve, we could also add test for long-context-len scenario to prevent it from breaking

plz refer to https://github.com/vllm-project/vllm/blob/main/tests/benchmarks/test_serve_cli.py

Signed-off-by: Ming Yang <minos.future@gmail.com>

minosfuture · 2025-09-09T03:15:47Z

is async_request_openai_audio covered by CI? if not, @ywang96 @MengqingCao could you suggest how to test this (example model/bench serve cmd)? The change is the same as other functions, but I want to give it a sanity check.

I checked the test cases on vllm bench, and there is no cases for other endpoints except for openai. maybe we could add test of openai-chat and openai-audio endpoint-type for sanity check.

BTW, for openai, the default endpoint-type of vllm bench serve, we could also add test for long-context-len scenario to prevent it from breaking

plz refer to https://github.com/vllm-project/vllm/blob/main/tests/benchmarks/test_serve_cli.py

added a test for chat completions.
looks like we don't have random audio dataset support.
for the long context, we probably couldn't afford a model, during CI, that supports long context, like KimiK2.
Let's get this in for now? @ywang96 @MengqingCao

ywang96 · 2025-09-09T05:52:55Z

is async_request_openai_audio covered by CI? if not, @ywang96 @MengqingCao could you suggest how to test this (example model/bench serve cmd)? The change is the same as other functions, but I want to give it a sanity check.

I checked the test cases on vllm bench, and there is no cases for other endpoints except for openai. maybe we could add test of openai-chat and openai-audio endpoint-type for sanity check.
BTW, for openai, the default endpoint-type of vllm bench serve, we could also add test for long-context-len scenario to prevent it from breaking
plz refer to https://github.com/vllm-project/vllm/blob/main/tests/benchmarks/test_serve_cli.py

added a test for chat completions. looks like we don't have random audio dataset support. for the long context, we probably couldn't afford a model, during CI, that supports long context, like KimiK2. Let's get this in for now? @ywang96 @MengqingCao

No worries - we typically don't test benchmark scripts anyways in CI. We can get this in.

Signed-off-by: Ming Yang <minos.future@gmail.com>

Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: Ming Yang <minos.future@gmail.com>

Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

minosfuture added 2 commits September 6, 2025 10:54

support long prompt in benchmark

e6c08a4

Signed-off-by: Ming Yang <minos.future@gmail.com>

remove unused func

67612d5

Signed-off-by: Ming Yang <minos.future@gmail.com>

mergify bot added the performance Performance-related issues label Sep 6, 2025

gemini-code-assist bot reviewed Sep 6, 2025

View reviewed changes

vllm/benchmarks/lib/endpoint_request_func.py Outdated Show resolved Hide resolved

vllm/benchmarks/lib/endpoint_request_func.py Outdated Show resolved Hide resolved

Update vllm/benchmarks/lib/endpoint_request_func.py

03425c5

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ming Yang <minos.future@gmail.com>

youkaichao requested a review from ywang96 September 7, 2025 02:13

ywang96 reviewed Sep 7, 2025

View reviewed changes

update chat/completions as well; address comment

46e83b3

Signed-off-by: Ming Yang <minos.future@gmail.com>

ywang96 approved these changes Sep 7, 2025

View reviewed changes

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 7, 2025

Merge branch 'main' into support_long_context_bench

66ed3a3

ywang96 enabled auto-merge (squash) September 7, 2025 09:01

update async_request_openai_audio

f6d6ea2

Signed-off-by: Ming Yang <minos.future@gmail.com>

auto-merge was automatically disabled September 7, 2025 17:29
Head branch was pushed to by a user without write access

add chat completions test

3b4b8b2

Signed-off-by: Ming Yang <minos.future@gmail.com>

ywang96 merged commit 1823a00 into vllm-project:main Sep 9, 2025
38 checks passed

MengqingCao mentioned this pull request Sep 9, 2025

[Benchmark] Allow long context length serving benchmark #24333

Closed

5 tasks

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025

[Misc] Support bench serve long context (vllm-project#24373)

9fd20c1

Signed-off-by: Ming Yang <minos.future@gmail.com>

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[Misc] Support bench serve long context (vllm-project#24373)

c1522cd

Signed-off-by: Ming Yang <minos.future@gmail.com>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Misc] Support bench serve long context (vllm-project#24373)

ab19ced

Signed-off-by: Ming Yang <minos.future@gmail.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[Misc] Support bench serve long context (vllm-project#24373)

36de1e9

Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

sducouedic pushed a commit to sducouedic/vllm that referenced this pull request Oct 16, 2025

[Misc] Support bench serve long context (vllm-project#24373)

3b49e32

Signed-off-by: Ming Yang <minos.future@gmail.com>

Stonesjtu mentioned this pull request Oct 23, 2025

[benchmark] Support generalized SSE protocol #27412

Open

5 tasks

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Misc] Support bench serve long context (vllm-project#24373)

3121c0b

Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

		AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)


		class StreamedResponseHandler:

Uh oh!

[Misc] Support bench serve long context #24373

[Misc] Support bench serve long context #24373

Uh oh!

Conversation

minosfuture commented Sep 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ywang96 Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

minosfuture Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

minosfuture commented Sep 7, 2025

Uh oh!

ywang96 commented Sep 7, 2025

Uh oh!

MengqingCao commented Sep 7, 2025

Uh oh!

minosfuture commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MengqingCao commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

minosfuture commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywang96 commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

minosfuture commented Sep 6, 2025 •

edited by github-actions bot

Loading

minosfuture commented Sep 7, 2025 •

edited

Loading

MengqingCao commented Sep 8, 2025 •

edited

Loading

minosfuture commented Sep 9, 2025 •

edited

Loading