-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
Add return_token_ids parameter to OpenAI API endpoints #22587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add return_token_ids parameter to OpenAI API endpoints #22587
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the return_token_ids_alongside parameter to the OpenAI-compatible endpoints for chat and text completions. This is a well-motivated feature, particularly for agent-based reinforcement learning scenarios where having direct access to token IDs for both prompts and responses is essential. The implementation correctly adds the new parameter to the request models and populates the corresponding token ID fields in the response models. My main feedback is the absence of tests. While the changes appear correct, adding comprehensive tests is necessary to validate the new functionality and ensure long-term maintainability.
vllm/entrypoints/openai/protocol.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request introduces a valuable feature for agent-based scenarios. However, it currently lacks tests. Adding unit and integration tests is crucial to ensure the new return_token_ids_alongside parameter works as expected across all affected endpoints (/v1/chat/completions and /v1/completions) and to prevent future regressions. Please add tests that cover both streaming and non-streaming responses, and verify that the token IDs for both prompts and responses are correctly returned when the flag is enabled, and not returned when disabled.
- Add optional return_token_ids_alongside parameter to ChatCompletionRequest and CompletionRequest - Include token_ids and prompt_token_ids fields in response models when requested - Implement conditional logic in serving endpoints to return token IDs alongside generated text - Useful for debugging and agent scenarios where token-level tracing is needed 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Yuge Zhang <scottyugochang@gmail.com>
Signed-off-by: Yuge Zhang <scottyugochang@gmail.com>
|
unsubscribe |
2954f14 to
48dd2f4
Compare
Split long comment onto multiple lines for better readability. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Yuge Zhang <scottyugochang@gmail.com>
Improve the formatting of conditional token_ids and prompt_token_ids assignments to be more concise and readable. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Yuge Zhang <scottyugochang@gmail.com>
youkaichao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @njhill
the idea makes sense to me, since we also support other non-openai api comptaible features like beam search.
the key concern here is, if it adds any overhead when people don't request the token ids output.
in addition, please add some tests to make sure the behavior is tested?
|
also cc @hmellor do we have any centralized doc to keep track of these non-openai compatible behavior? |
|
Noe a doc specifically for this, but in https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html each API has separated sections for normal and "extra" params. Although, looking at the src this actually excludes the OpenAI arguments completely... |
@youkaichao Thanks for the review. I don't think it adds any overhead (if you are talking about machine overhead, instead of mental overhead), because the variables are already there and I'm just returning it in case a sampling flag is set. I'll add tests. Probably will take me some time to set up the test env. |
|
I think this is reasonable/useful. I don't like the parameter name Couple of questions:
|
Totally agree that we need a simple, raw, token-in-token-out endpoint! |
There is a
It adds one more complexity to the API, and I can't see why that's necessary. If that comes as a feature request in future, we can make
This will simply break all existing agent code and frameworks based on OpenAI API endpoint. We need to perform rollouts on OpenAI endpoint, while tracing the token ids in the telemetry for training. If someone is not afraid of refactoring code, they can do it, but I guess that's not part of this PR. |
Yes I guessed that was the reason but
Sounds reasonable
Right, I wasn't suggesting this would replace the OpenAI API, would just be a simpler alternative. And wasn't suggesting it should be tied to this PR! |
Signed-off-by: Yuge Zhang <scottyugochang@gmail.com>
Signed-off-by: Yuge Zhang <scottyugochang@gmail.com>
|
@njhill @youkaichao @hmellor The test work is done. I've brought in support for streaming=True. It's a bit tricky. Please help review. |
|
I can't see the full logs of the fastcheck here: https://buildkite.com/vllm/fastcheck/builds/34977/steps/canvas?jid=01989c1a-a53e-4511-b53f-2f4dfb61d9ba Is it related to the changes I've made? |
|
Can you merge from main? It should resolve the CI failure |
…de-feature Signed-off-by: Yuge Zhang <scottyugochang@gmail.com>
|
I think the newly added test went well. Related logs:I failed to find failed tests though. XFAIL doesn't matter I guess? I got some SUBFAIL like this: I think it's related to the API schema change? How do I properly update the schema (i.e., openapi.json)? And why does API like rerank fail? I didn't touch them at all. |
|
Looks like a connection error, let me retry |
Signed-off-by: Yuge Zhang <scottyugochang@gmail.com>
…ature Signed-off-by: Yuge Zhang <scottyugochang@gmail.com>
|
Hi @DarkLight1337 @simon-mo @njhill The tests finally passed! 😄 I think there is a bug in grammar when |
|
Thanks @ultmaster so the openai schema test "fix" you made is unrelated to this PR? Just curious why it was failing on this branch but not main (that I'm aware of)? |
@njhill I think there is a bug in grammar module in handling empty strings. I temporarily bypassed the related tests to get a CI pass. Fixing that part is not my work I think. :) It was not failing on main branch; it's also not failing on my local machine; it's only failing in CI environment. It seems that the test cases in Nevertheless, I don't think setting grammar to be empty is a common real use case. So it's up to you guys whether to fix it. I can also revert the patch if CI failure is okay (but it appears to be not okay based on conversations) |
|
@ultmaster ok, thank you for the investigation. Let's open a new issue to track the test problem and it can reference the filter change you made here. |
…22587) Signed-off-by: Yuge Zhang <scottyugochang@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
…22587) Signed-off-by: Yuge Zhang <scottyugochang@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
…22587) Signed-off-by: Yuge Zhang <scottyugochang@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
…22587) Signed-off-by: Yuge Zhang <scottyugochang@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>
…22587) Signed-off-by: Yuge Zhang <scottyugochang@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
…22587) Signed-off-by: Yuge Zhang <scottyugochang@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
…22587) Signed-off-by: Yuge Zhang <scottyugochang@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
…22587) Signed-off-by: Yuge Zhang <scottyugochang@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
…22587) Signed-off-by: Yuge Zhang <scottyugochang@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
…22587) Signed-off-by: Yuge Zhang <scottyugochang@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
…22587) Signed-off-by: Yuge Zhang <scottyugochang@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Getting the token IDs (not logprobs) from the vllm inference responses is very important for agent RL-training scenarios, especially when the agent loop rely on vLLM OpenAI endpoint as a fast server to perform rollouts and collect trajectories like
[(prompt_token_ids, response_token_ids, reward), ...]. The agents need the raw prompts and responses as strings. They also need to track the under-the-hood tokens, so that the RL algorithms can use them for optimization.When I authored agent-lightning in the first place, it's very hard to get the token ids and response ids from vLLM, despite the fact that they are set as local variables in openai-compatible server implementation. This leads me to write a monkey-patch, which essentially modified
OpenAIServingChat.chat_completion_full_generatorto capture the input token ids and output token ids. The code is here, not a long code:https://github.com/microsoft/agent-lightning/blob/24d590f4ea135bd88d8d3c5299526b7d5866b100/agentlightning/instrumentation/vllm.py
Recently, I found that vllm has supported
prompt_logprobsandreturn_tokens_as_token_idsas additional parameters to the chat completion API. Throw I don't need logprobs, I thought it would be wonderful to have the token ids from logprobs. But it turns out different from what I thought. I tested with Qwen2.5-0.5B-Instruct,prompt_logprobsis giving me different results fromprompt_token_ids:For responses, the returned "token:12345" look okay with
return_tokens_as_token_idson. It's a little unstraightforward though, to parse the integer from a string like "token:xxxx".So, this PR adds the token ids alongside the prompts and responses.
Update: rename as return_token_ids.
Test Plan
Unit tests added.
Test Result
Passed locally.
(Optional) Documentation Update
In code descriptions.