Skip to content

Conversation

@w32zhong
Copy link

@w32zhong w32zhong commented Nov 4, 2025

Motivation

Current EAGLE draft path is ranked by direct joint probability and requires a torch.mul in a hotspot loop select_top_k_tokens (needs to run multiple times for one draft step):

expand_scores = torch.mul(
scores.unsqueeze(2), topk_p.reshape(-1, topk, topk)
) # (b, topk, 1) x (b, topk ,topk) -> (b, topk, topk)

The original EAGLE code ranks draft paths by log probability and uses a (broadcast) addition operation:

https://github.com/SafeAILab/EAGLE/blob/ee3b040e84b67c212046a8e3c37b31791fecc071/eagle/model/cnets.py#L735-L740

This issue may hurt minor accept rates due to the precision sensitivity of probs multiplication. In a compute-bound decoding, this may also expose minor speedup potential (e.g., casting float add to using other compute units). (does not see any speed difference after some tests)

Modifications

  • mainly in EAGLE select_top_k_tokens
  • minor variable renames (topk_p --> topk_logp)
  • also fix a minor bug in this line where the test for a multi-batch should be hidden_states.shape[0] > topk rather than hidden_states.shape[0] > 0.

Accuracy Tests

On a RTX 4070 TiS, running against MT-Bench 80 questions, with bf16 and --disable_cuda_graph --speculative_algorithm EAGLE --mtbench question.jsonl --bs 32 --speculative-num-steps 6 --speculative-eagle-topk 10 --speculative-num-draft-tokens 60 and using a small 4B Qwen3 model custom trained (see #11879) in order to fit my GPU with tp=1.

Before:

             completion_tokens: 114040
                spec_verify_ct: 36323
                     time_cost: 400.55
      scheduler_avg_accept_len: 3.126
                avg_accept_len: 3.14
                   throughputs: 284.711

After:

             completion_tokens: 112886
                spec_verify_ct: 35882
                     time_cost: 393.8
      scheduler_avg_accept_len: 3.132
                avg_accept_len: 3.146
                   throughputs: 286.658

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @w32zhong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the EAGLE speculative decoding algorithm by transitioning from direct probability calculations to log probabilities for token selection. This change not only enhances numerical stability by avoiding precision issues associated with multiplying small probabilities but also boosts performance by replacing a costly matrix multiplication operation with a more efficient broadcast addition. The modifications ensure the system adheres more closely to the original EAGLE design principles while improving overall throughput.

Highlights

  • Algorithm Optimization: The core EAGLE token selection logic in select_top_k_tokens has been refactored to use log probabilities instead of direct joint probabilities. This change aligns with the original EAGLE paper and replaces a computationally intensive matrix multiplication with a broadcast addition, improving efficiency.
  • Variable Renaming: Throughout the codebase, variables storing probabilities (topk_p) have been consistently renamed to topk_logp to reflect the change to log probabilities.
  • Bug Fix: A minor bug in the select_top_k_tokens function related to the batch size check for hidden_states was corrected, changing hidden_states.shape[0] > 0 to hidden_states.shape[0] > topk for proper batched processing.
  • Code Simplification: The capture_for_decode method in eagle_worker.py was removed, with its functionality integrated into a new prepare_draft_root method within the EagleDraftInput class, streamlining the draft root preparation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable optimization to the EAGLE speculative decoding algorithm by switching from probability multiplication to log-probability addition. This change is expected to improve both performance and numerical stability. The refactoring of the top-k token selection logic into a dedicated method prepare_draft_root also enhances code clarity and maintainability. Additionally, a bug in hidden state indexing for multi-batch scenarios is correctly addressed.

I have two points of feedback on the implementation in select_top_k_tokens. One is a minor comment clarification, and the other is a potential issue in the parent index calculation for the draft token tree which could affect the correctness of the speculative decoding. Please see the detailed comments below.

per Gemini suggestion.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@w32zhong w32zhong changed the title Optimize EAGLE select_top_k_tokens: use logprobs, reduce matmul. Optimize EAGLE select_top_k_tokens: use logprobs. Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant