Skip to content

Conversation

@minosfuture
Copy link
Contributor

@minosfuture minosfuture commented Sep 6, 2025

Purpose

Before this change, we'd hit the following "chunk too big" error when running benchmark with long prompt.

This PR changes the response streaming into iterations (from response.content to response.content.iter_any()). With this, we won't hit the error any more but need to take care of the following in this PR:

  1. multiple messages may be received together, so we need to split them by '\n\n'.
  2. incomplete message may be received, so we need to accumulate chunks until we see complete message.
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Traceback (most recent call last):
  File "/data/users/yming/gitrepos/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 105, in async_request_openai_completions
    async for chunk_bytes in response.content:
  File "/home/yming/uv_env/vllm/lib64/python3.12/site-packages/aiohttp/streams.py", line 50, in __anext__
    rv = await self.read_func()
         ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yming/uv_env/vllm/lib64/python3.12/site-packages/aiohttp/streams.py", line 321, in readline
    return await self.readuntil()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yming/uv_env/vllm/lib64/python3.12/site-packages/aiohttp/streams.py", line 349, in readuntil
    raise ValueError("Chunk too big")
ValueError: Chunk too big

Test Plan

verify both long and short prompt work

Test Result

============ Serving Benchmark Result ============
Successful requests:                     1
Maximum request concurrency:             1
Benchmark duration (s):                  33.71
Total input tokens:                      204800
Total generated tokens:                  1024
Request throughput (req/s):              0.03
Output token throughput (tok/s):         30.37
Total Token throughput (tok/s):          6105.28
---------------Time to First Token----------------
Mean TTFT (ms):                          1147.69
Median TTFT (ms):                        1147.69
P99 TTFT (ms):                           1147.69
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          31.83
Median TPOT (ms):                        31.83
P99 TPOT (ms):                           31.83
---------------Inter-token Latency----------------
Mean ITL (ms):                           31.83
Median ITL (ms):                         31.73
P99 ITL (ms):                            33.58
==================================================

============ Serving Benchmark Result ============
Successful requests:                     1
Maximum request concurrency:             1
Benchmark duration (s):                  0.22
Total input tokens:                      32
Total generated tokens:                  8
Request throughput (req/s):              4.59
Output token throughput (tok/s):         36.69
Total Token throughput (tok/s):          183.44
---------------Time to First Token----------------
Mean TTFT (ms):                          43.37
Median TTFT (ms):                        43.37
P99 TTFT (ms):                           43.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.88
Median TPOT (ms):                        24.88
P99 TPOT (ms):                           24.88
---------------Inter-token Latency----------------
Mean ITL (ms):                           24.88
Median ITL (ms):                         24.71
P99 ITL (ms):                            25.66
==================================================


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
@mergify mergify bot added the performance Performance-related issues label Sep 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a 'chunk too big' error in benchmarks with long prompts by switching to response.content.iter_any() for streaming. The introduction of StreamedResponseHandler is a good approach to manage chunked SSE messages. My review focuses on ensuring the robustness of this new handler and maintaining clean benchmark output. I've identified a critical issue in the stream handling logic that could lead to incorrect message processing and a high-severity issue regarding a debug print statement that should be removed.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
@youkaichao youkaichao requested a review from ywang96 September 7, 2025 02:13
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two comments - PTAL

I also wonder if we can simply fix it by modifying read_bufsize of aiohttp session? That's probably cleaner so that we don't need to maintain this customer handler ourself. I believe @MengqingCao has already attempted that

AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)


class StreamedResponseHandler:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also modify async_request_openai_chat_completions to use this handler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Signed-off-by: Ming Yang <minos.future@gmail.com>
@minosfuture
Copy link
Contributor Author

I left two comments - PTAL

I also wonder if we can simply fix it by modifying read_bufsize of aiohttp session? That's probably cleaner so that we don't need to maintain this customer handler ourself. I believe @MengqingCao has already attempted that

Updated! thx!
I think it would be better to have this handler that frees us from maintaining a proper buf size once for all. It should work with any context length.

@ywang96
Copy link
Member

ywang96 commented Sep 7, 2025

I left two comments - PTAL
I also wonder if we can simply fix it by modifying read_bufsize of aiohttp session? That's probably cleaner so that we don't need to maintain this customer handler ourself. I believe @MengqingCao has already attempted that

Updated! thx! I think it would be better to have this handler that frees us from maintaining a proper buf size once for all. It should work with any context length.

That's a good point - let's stick with this then

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 7, 2025
@ywang96 ywang96 enabled auto-merge (squash) September 7, 2025 09:01
@MengqingCao
Copy link
Contributor

Updated! thx! I think it would be better to have this handler that frees us from maintaining a proper buf size once for all. It should work with any context length.

Thanks for this work! I agree that we maintain a custom handler, as actually users should modify the context length if the approximate read_bufsize in #24333 is not enough for them.

Just a small suggestion, I think async_request_openai_audio and async_request_openai_embeddings could also use this handler? and so as the funcs in benchmarks/backend_request_func.py

Signed-off-by: Ming Yang <minos.future@gmail.com>
auto-merge was automatically disabled September 7, 2025 17:29

Head branch was pushed to by a user without write access

@minosfuture
Copy link
Contributor Author

minosfuture commented Sep 7, 2025

Updated! thx! I think it would be better to have this handler that frees us from maintaining a proper buf size once for all. It should work with any context length.

Thanks for this work! I agree that we maintain a custom handler, as actually users should modify the context length if the approximate read_bufsize in #24333 is not enough for them.

Just a small suggestion, I think async_request_openai_audio and async_request_openai_embeddings could also use this handler? and so as the funcs in benchmarks/backend_request_func.py

thx! updated async_request_openai_audio.
async_request_openai_embeddings is not checking response.content.
I'll leave benchmarks/backend_request_func.py as is for now since it'll be deprecated.

is async_request_openai_audio covered by CI? if not, @ywang96 @MengqingCao could you suggest how to test this (example model/bench serve cmd)? The change is the same as other functions, but I want to give it a sanity check.

@MengqingCao
Copy link
Contributor

MengqingCao commented Sep 8, 2025

is async_request_openai_audio covered by CI? if not, @ywang96 @MengqingCao could you suggest how to test this (example model/bench serve cmd)? The change is the same as other functions, but I want to give it a sanity check.

I checked the test cases on vllm bench, and there is no cases for other endpoints except for openai. maybe we could add test of openai-chat and openai-audio endpoint-type for sanity check.

BTW, for openai, the default endpoint-type of vllm bench serve, we could also add test for long-context-len scenario to prevent it from breaking

plz refer to https://github.com/vllm-project/vllm/blob/main/tests/benchmarks/test_serve_cli.py

Signed-off-by: Ming Yang <minos.future@gmail.com>
@minosfuture
Copy link
Contributor Author

minosfuture commented Sep 9, 2025

is async_request_openai_audio covered by CI? if not, @ywang96 @MengqingCao could you suggest how to test this (example model/bench serve cmd)? The change is the same as other functions, but I want to give it a sanity check.

I checked the test cases on vllm bench, and there is no cases for other endpoints except for openai. maybe we could add test of openai-chat and openai-audio endpoint-type for sanity check.

BTW, for openai, the default endpoint-type of vllm bench serve, we could also add test for long-context-len scenario to prevent it from breaking

plz refer to https://github.com/vllm-project/vllm/blob/main/tests/benchmarks/test_serve_cli.py

added a test for chat completions.
looks like we don't have random audio dataset support.
for the long context, we probably couldn't afford a model, during CI, that supports long context, like KimiK2.
Let's get this in for now? @ywang96 @MengqingCao

@ywang96
Copy link
Member

ywang96 commented Sep 9, 2025

is async_request_openai_audio covered by CI? if not, @ywang96 @MengqingCao could you suggest how to test this (example model/bench serve cmd)? The change is the same as other functions, but I want to give it a sanity check.

I checked the test cases on vllm bench, and there is no cases for other endpoints except for openai. maybe we could add test of openai-chat and openai-audio endpoint-type for sanity check.
BTW, for openai, the default endpoint-type of vllm bench serve, we could also add test for long-context-len scenario to prevent it from breaking
plz refer to https://github.com/vllm-project/vllm/blob/main/tests/benchmarks/test_serve_cli.py

added a test for chat completions. looks like we don't have random audio dataset support. for the long context, we probably couldn't afford a model, during CI, that supports long context, like KimiK2. Let's get this in for now? @ywang96 @MengqingCao

No worries - we typically don't test benchmark scripts anyways in CI. We can get this in.

@ywang96 ywang96 merged commit 1823a00 into vllm-project:main Sep 9, 2025
38 checks passed
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
Signed-off-by: Ming Yang <minos.future@gmail.com>
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
Signed-off-by: Ming Yang <minos.future@gmail.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Ming Yang <minos.future@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
sducouedic pushed a commit to sducouedic/vllm that referenced this pull request Oct 16, 2025
Signed-off-by: Ming Yang <minos.future@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants