-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Optimize EAGLE select_top_k_tokens: use logprobs. #12637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @w32zhong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly optimizes the EAGLE speculative decoding algorithm by transitioning from direct probability calculations to log probabilities for token selection. This change not only enhances numerical stability by avoiding precision issues associated with multiplying small probabilities but also boosts performance by replacing a costly matrix multiplication operation with a more efficient broadcast addition. The modifications ensure the system adheres more closely to the original EAGLE design principles while improving overall throughput. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable optimization to the EAGLE speculative decoding algorithm by switching from probability multiplication to log-probability addition. This change is expected to improve both performance and numerical stability. The refactoring of the top-k token selection logic into a dedicated method prepare_draft_root also enhances code clarity and maintainability. Additionally, a bug in hidden state indexing for multi-batch scenarios is correctly addressed.
I have two points of feedback on the implementation in select_top_k_tokens. One is a minor comment clarification, and the other is a potential issue in the parent index calculation for the draft token tree which could affect the correctness of the speculative decoding. Please see the detailed comments below.
per Gemini suggestion. Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Motivation
Current EAGLE draft path is ranked by direct joint probability and requires a torch.mul in a hotspot loop
select_top_k_tokens(needs to run multiple times for one draft step):sglang/python/sglang/srt/speculative/spec_utils.py
Lines 458 to 460 in 30b26ee
The original EAGLE code ranks draft paths by log probability and uses a (broadcast) addition operation:
https://github.com/SafeAILab/EAGLE/blob/ee3b040e84b67c212046a8e3c37b31791fecc071/eagle/model/cnets.py#L735-L740
This issue may hurt minor accept rates due to the precision sensitivity of probs multiplication.
In a compute-bound decoding, this may also expose minor speedup potential (e.g., casting float add to using other compute units).(does not see any speed difference after some tests)Modifications
topk_p-->topk_logp)hidden_states.shape[0] > topkrather thanhidden_states.shape[0] > 0.Accuracy Tests
On a RTX 4070 TiS, running against MT-Bench 80 questions, with bf16 and
--disable_cuda_graph --speculative_algorithm EAGLE --mtbench question.jsonl --bs 32 --speculative-num-steps 6 --speculative-eagle-topk 10 --speculative-num-draft-tokens 60and using a small 4B Qwen3 model custom trained (see #11879) in order to fit my GPU with tp=1.Before:
After:
Benchmarking and Profiling
Checklist