Optimize EAGLE select_top_k_tokens: use logprobs. #12637

w32zhong · 2025-11-04T17:54:48Z

Motivation

Current EAGLE draft path is ranked by direct joint probability and requires a torch.mul in a hotspot loop select_top_k_tokens (needs to run multiple times for one draft step):

sglang/python/sglang/srt/speculative/spec_utils.py

Lines 458 to 460 in 30b26ee

    
           expand_scores = torch.mul( 
        
               scores.unsqueeze(2), topk_p.reshape(-1, topk, topk) 
        
           )  # (b, topk, 1) x (b, topk ,topk) -> (b, topk, topk)

The original EAGLE code ranks draft paths by log probability and uses a (broadcast) addition operation:

https://github.com/SafeAILab/EAGLE/blob/ee3b040e84b67c212046a8e3c37b31791fecc071/eagle/model/cnets.py#L735-L740

This issue may hurt minor accept rates due to the precision sensitivity of probs multiplication. ~~In a compute-bound decoding, this may also expose minor speedup potential (e.g., casting float add to using other compute units).~~ (does not see any speed difference after some tests)

Modifications

mainly in EAGLE select_top_k_tokens
minor variable renames (topk_p --> topk_logp)
also fix a minor bug in this line where the test for a multi-batch should be hidden_states.shape[0] > topk rather than hidden_states.shape[0] > 0.

Accuracy Tests

On a RTX 4070 TiS, running against MT-Bench 80 questions, with bf16 and --disable_cuda_graph --speculative_algorithm EAGLE --mtbench question.jsonl --bs 32 --speculative-num-steps 6 --speculative-eagle-topk 10 --speculative-num-draft-tokens 60 and using a small 4B Qwen3 model custom trained (see #11879) in order to fit my GPU with tp=1.

Before:

             completion_tokens: 114040
                spec_verify_ct: 36323
                     time_cost: 400.55
      scheduler_avg_accept_len: 3.126
                avg_accept_len: 3.14
                   throughputs: 284.711

After:

             completion_tokens: 112886
                spec_verify_ct: 35882
                     time_cost: 393.8
      scheduler_avg_accept_len: 3.132
                avg_accept_len: 3.146
                   throughputs: 286.658

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-11-04T17:55:13Z

Summary of Changes

Hello @w32zhong, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the EAGLE speculative decoding algorithm by transitioning from direct probability calculations to log probabilities for token selection. This change not only enhances numerical stability by avoiding precision issues associated with multiplying small probabilities but also boosts performance by replacing a costly matrix multiplication operation with a more efficient broadcast addition. The modifications ensure the system adheres more closely to the original EAGLE design principles while improving overall throughput.

Highlights

Algorithm Optimization: The core EAGLE token selection logic in select_top_k_tokens has been refactored to use log probabilities instead of direct joint probabilities. This change aligns with the original EAGLE paper and replaces a computationally intensive matrix multiplication with a broadcast addition, improving efficiency.
Variable Renaming: Throughout the codebase, variables storing probabilities (topk_p) have been consistently renamed to topk_logp to reflect the change to log probabilities.
Bug Fix: A minor bug in the select_top_k_tokens function related to the batch size check for hidden_states was corrected, changing hidden_states.shape[0] > 0 to hidden_states.shape[0] > topk for proper batched processing.
Code Simplification: The capture_for_decode method in eagle_worker.py was removed, with its functionality integrated into a new prepare_draft_root method within the EagleDraftInput class, streamlining the draft root preparation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable optimization to the EAGLE speculative decoding algorithm by switching from probability multiplication to log-probability addition. This change is expected to improve both performance and numerical stability. The refactoring of the top-k token selection logic into a dedicated method prepare_draft_root also enhances code clarity and maintainability. Additionally, a bug in hidden state indexing for multi-batch scenarios is correctly addressed.

I have two points of feedback on the implementation in select_top_k_tokens. One is a minor comment clarification, and the other is a potential issue in the parent index calculation for the draft token tree which could affect the correctness of the speculative decoding. Please see the detailed comments below.

python/sglang/srt/speculative/spec_utils.py

per Gemini suggestion. Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

w32zhong added 3 commits November 4, 2025 17:04

Optimize EAGLE hotspot: use logprobs, reduce matmul.

25d33cb

minor rename

d4ac2f7

pre-commit lint.

b939253

w32zhong requested review from ByronHsu, Ying1123, hnyls2002, ispobock, kssteven418, merrymercy and zhyncs as code owners November 4, 2025 17:54

gemini-code-assist bot reviewed Nov 4, 2025

View reviewed changes

python/sglang/srt/speculative/spec_utils.py Outdated Show resolved Hide resolved

Update python/sglang/srt/speculative/spec_utils.py

044323c

per Gemini suggestion. Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

w32zhong changed the title ~~Optimize EAGLE select_top_k_tokens: use logprobs, reduce matmul.~~ Optimize EAGLE select_top_k_tokens: use logprobs. Nov 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize EAGLE select_top_k_tokens: use logprobs. #12637

Optimize EAGLE select_top_k_tokens: use logprobs. #12637

w32zhong commented Nov 4, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	expand_scores = torch.mul(
	scores.unsqueeze(2), topk_p.reshape(-1, topk, topk)
	) # (b, topk, 1) x (b, topk ,topk) -> (b, topk, topk)

Optimize EAGLE select_top_k_tokens: use logprobs. #12637

Are you sure you want to change the base?

Optimize EAGLE select_top_k_tokens: use logprobs. #12637

Conversation

w32zhong commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

w32zhong commented Nov 4, 2025 •

edited

Loading