-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
[Optimization] Speed up function _convert_tokens_to_string_with_added_encoders by 13.7x
#20413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Optimization] Speed up function _convert_tokens_to_string_with_added_encoders by 13.7x
#20413
Conversation
…by 60% Here is an optimized rewrite of your program focusing mainly on _convert_tokens_to_string_with_added_encoders, which (according to your profiling) is the bottleneck. - Cache repeated lookups (such as tokenizer.get_added_vocab() and tokenizer.all_special_tokens). - Avoid unnecessary list[str] allocations inside hot loops. - Use local variables for repeated attribute lookups. - Precompute set of added tokens (O(1) lookups), and convert all_special_tokens to a set only once. - Convert current_sub_text to string only when strictly necessary. - Append/clear efficiently. Other parts (like class PreTrainedTokenizer and TokenizerBase) are left unchanged, as they're not relevant for performance. All docstrings and copyright notices preserved. **Summary of optimizations:** - Compute `set(tokenizer.all_special_tokens)` and `set(tokenizer.get_added_vocab())` once before the loop, **not** inside. - Use `current_sub_text.clear()` instead of allocating a new list. - Store `convert_tokens_to_string` to a local variable to avoid repeated attribute lookups. - Avoid unnecessary creation of empty sets or dicts. - Use tuple `()` instead of set for skip_special_tokens=False for speed (since membership test will always fail). This will provide a meaningful speedup for the profiled bottleneck.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @misrasaurabh1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a substantial performance optimization to the _convert_tokens_to_string_with_added_encoders function, which is a critical component in the detokenization process. The changes primarily focus on reducing redundant attribute lookups and pre-computing frequently accessed tokenizer properties, leading to a significant speedup in incremental decoding.
Highlights
- Performance Optimization: The core
_convert_tokens_to_string_with_added_encodersfunction has been significantly optimized, achieving a reported 13.7x speedup in detokenization, crucial for incremental decoding. - Caching Tokenizer Properties and Methods: To reduce overhead in the hot loop, the
tokenizer.convert_tokens_to_stringmethod is now localized to a variable, and the result oftokenizer.get_added_vocab()is pre-computed into a set (added_vocab_set) before iteration. Additionally,tokenizer.all_special_tokensis conditionally converted to a set only whenskip_special_tokensis true, otherwise an empty tuple is used. - List Manipulation Efficiency: The
current_sub_textlist is now cleared usingcurrent_sub_text.clear()instead of reassigning an empty list, which can offer minor efficiency gains.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request optimizes the _convert_tokens_to_string_with_added_encoders function, significantly improving performance. The changes involve caching vocabulary lookups and localizing method calls, supported by benchmarks and regression tests. A suggestion is provided to enhance code conciseness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @misrasaurabh1, looks good to me.
I'll note that this path should no longer be very commonly used anyhow since it applies only to the "slow" incremental detokenizer case, the majority of tokenizers will support fast incremental detokenization (see https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/detokenizer.py).
|
@misrasaurabh1 please also sign-off your commits per https://github.com/vllm-project/vllm/pull/20413/checks?check_run_id=45267543355, thanks! |
Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>
Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>
29705e9 to
70c5640
Compare
Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>
04d54d1 to
9a72549
Compare
|
Hi @njhill I fixed the issue above, this should now be ready to be merged |
|
Thanks @misrasaurabh1. Please also fix the pre-commit linter errors. |
…_with_added_encoders-mcmk4l63
Signed-off-by: Aseem Saxena <aseem.bits@gmail.com>
|
@njhill we have fixed the pre-commit linter errors, it's ready to merge! |
…_with_added_encoders-mcmk4l63
…_with_added_encoders-mcmk4l63
|
Thanks @misrasaurabh1 @aseembits93 |
…_with_added_encoders-mcmk4l63
…rs` by 13.7x (vllm-project#20413) Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com> Signed-off-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>
…rs` by 13.7x (vllm-project#20413) Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com> Signed-off-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>
…rs` by 13.7x (vllm-project#20413) Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com> Signed-off-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
…rs` by 13.7x (vllm-project#20413) Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com> Signed-off-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>
…rs` by 13.7x (vllm-project#20413) Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com> Signed-off-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>
…rs` by 13.7x (vllm-project#20413) Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com> Signed-off-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>
…rs` by 13.7x (vllm-project#20413) Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com> Signed-off-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Incremental decoding can be slow, according to an original comment on this code section. I tried optimizing it with Codeflash (the automated performance optimization tool I'm building) and found this optimization. I verified it with the below benchmark, and generated many tests to ensure correctness.
⏱️ Runtime :
0.28 seconds→0.019 seconds(per 1000 loops)The main optimization is to not call the function
tokenizer.get_added_vocab()in a loop as from line profiling it took a large amount of time.tokenizer.all_special_tokenswas also conditionally made into a set, only when required.tokenizer.convert_tokens_to_stringwas also localized to improve the function calling time in a hot loop.Test Plan
Benchmarking script I wrote manually to verify performance gains. The original implementation is the
_slowfunction -I also verified correctness by generating and running the below regression tests. The behavior across the 2 functions were exactly the same.
🌀 Generated Regression Tests Summary
Test Result
Benchmarking Results -
✅ Correctness verification report:
(Optional) Documentation Update