Skip to content

Conversation

@attack204
Copy link

@attack204 attack204 commented Nov 4, 2025

Motivation

  • Implement standaloneV2 and enable it to run the test.send_one test (Done, But the performance of standaloneV2 is worse than standalone.)
  • Performance optimization: Investigate the reasons for performance regression and achieve performance improvements
  • Unit testing: Add a unit test for standaloneV2

Test And Benchmark

  • Env: 1*H200

  • standaloneV2

export CUDA_VISIBLE_DEVICES=1
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
export SGLANG_ENABLE_SPEC_V2=1
MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
SPEC_MODEL=meta-llama/Llama-3.2-1B-Instruct
PORT=30000
python3 -m sglang.launch_server \
    --dtype float16 \
    --model $MODEL \
    --speculative-algo STANDALONE \
    --speculative-draft-model-path $SPEC_MODEL \
    --attention-backend triton \
    --cuda-graph-bs $(seq -s ' ' 1 8) \
    --trust-remote-code \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 6 \
    --mem-fraction-static 0.5 \
    --port $PORT

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
SPEC_MODEL=meta-llama/Llama-3.2-1B-Instruct
PORT=30000
python3 -m sglang.test.send_one --profile

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.893    |  512   |   4.571    |     177.00      |
+-------------+--------+------------+-----------------+

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    3.308    |  512   |   4.571    |     154.77      |
+-------------+--------+------------+-----------------+

export CUDA_VISIBLE_DEVICES=1
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
export SGLANG_ENABLE_SPEC_V2=0
MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
SPEC_MODEL=meta-llama/Llama-3.2-1B-Instruct
PORT=30000
python3 -m sglang.launch_server \
    --model $MODEL \
    --speculative-algo STANDALONE \
    --speculative-draft-model-path $SPEC_MODEL \
    --attention-backend triton \
    --trust-remote-code \
    --cuda-graph-bs $(seq -s ' ' 1 8) \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 6 \
    --mem-fraction-static 0.5 \
    --port $PORT

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
SPEC_MODEL=meta-llama/Llama-3.2-1B-Instruct
PORT=30000
python3 -m sglang.test.send_one --profile


+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.742    |  512   |   4.876    |     186.71      |
+-------------+--------+------------+-----------------+

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    3.135    |  512   |   3.606    |     163.34      |
+-------------+--------+------------+-----------------+

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@attack204 attack204 changed the title to issue#11762: add standalone_worker_v2 feat(SpecEagleV2): add standalone_worker_v2 Nov 4, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @attack204, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces standalone_worker_v2, a new iteration of the standalone speculative decoding algorithm, designed to integrate with the system's overlap utility features. The changes involve extending existing EAGLE algorithm logic to support standaloneV2 across various core components, including overlap utilities, batch scheduling, and output processing. A dedicated draft worker is also implemented for this new version. While the new implementation is functional and passes basic tests, initial performance benchmarks show a regression compared to the previous version, indicating that further optimization efforts are required.

Highlights

  • New Speculative Algorithm Implementation: A new version of the 'standalone' speculative decoding algorithm, standalone_worker_v2, has been introduced. This implementation largely mirrors the structure of EAGLEWorkerV2 and aims to integrate with the system's overlap utility.
  • Overlap Utility Integration: The standalone_worker_v2 algorithm is now integrated into existing overlap utility functions (_lazy_init_buf, resolve_future, store_to_map), allowing it to leverage shared buffer initialization, future index resolution, and result storage mechanisms previously used by the EAGLE algorithm.
  • Scheduler and Output Processing Alignment: Updates to the scheduler and output processing logic ensure that standalone_worker_v2 is handled consistently with EAGLEV2 for tasks such as decode preparation, verification synchronization, batch execution, and token output resolution.
  • Dedicated Draft Worker: A StandaloneDraftWorker has been added, which is a custom EagleDraftWorker designed to operate with its own embeddings and language model head, rather than sharing them with the target model. This allows for more independent draft model operation within the standalone_worker_v2 framework.
  • Initial Performance Observations: Initial benchmarks indicate that standalone_worker_v2 currently exhibits worse performance compared to the original standalone implementation, with plans for future performance optimization and unit testing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @attack204, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new version of the standalone speculative decoding worker, standalone_worker_v2, into the system. This new implementation aligns the standalone algorithm with the existing V2 speculative decoding framework, which was previously used by the EAGLE algorithm. The changes involve creating dedicated worker classes and updating core scheduling and output processing logic to support this new worker. While the functionality is in place and tested, there's a noted performance regression that will be addressed in subsequent work.

Highlights

  • New Speculative Decoding Worker: Introduced standalone_worker_v2 to support a new version of the standalone speculative decoding algorithm, leveraging the V2 speculative decoding infrastructure.
  • Integration with V2 Speculative Path: Modified various manager components (overlap_utils.py, schedule_batch.py, scheduler.py, scheduler_output_processor_mixin.py) to recognize and utilize the STANDALONE algorithm within the V2 speculative decoding flow, similar to the EAGLE algorithm.
  • Custom Draft Worker: Added StandaloneDraftWorker which extends EagleDraftWorker but is specifically designed not to share embeddings or the language model head with the target model, allowing for more independent draft model operation.
  • Performance Note: Initial testing indicates that the performance of standaloneV2 is currently worse than the original standalone implementation, with plans for future performance optimization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @attack204, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, more modular version of the 'standalone' speculative decoding worker, standalone_worker_v2, designed to leverage the advanced V2 speculative decoding infrastructure. It establishes a custom draft worker that operates independently of the target model's embeddings and LM head, allowing for greater architectural flexibility. While the core functionality is in place and testable, the current focus is on integrating it into the existing system, with performance optimizations planned for subsequent iterations.

Highlights

  • New Speculative Decoding Worker: Introduced standalone_worker_v2 to enable a new version of the 'standalone' speculative decoding algorithm, integrating it into the existing V2 speculative decoding framework.
  • Custom Draft Worker: Added StandaloneDraftWorker, a specialized draft worker that manages its own embeddings and LM head, distinct from the target model, providing more flexibility for the standalone V2 implementation.
  • V2 Speculative Path Integration: Extended the V2 speculative decoding logic in various manager components (overlap_utils, schedule_batch, scheduler, scheduler_output_processor_mixin) to support the STANDALONE algorithm alongside EAGLE and EAGLE3.
  • Performance Note: Initial testing indicates that standaloneV2 currently exhibits worse performance (higher latency, lower token/s) compared to the original standalone implementation, with performance optimization identified as a future task.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces standalone_worker_v2 to support the standalone speculative decoding algorithm with overlap scheduling (SpecEagleV2). The changes primarily involve extending existing logic for the 'EAGLE' algorithm to also cover the 'STANDALONE' algorithm across various manager and utility files. A new file standalone_worker_v2.py is added, which defines the v2 worker.

The implementation looks solid and follows the existing patterns. My main feedback is around improving maintainability by reducing code duplication. I've pointed out a couple of repeated conditional checks that could be refactored into helper properties or methods. Additionally, there's a FIXME in the new worker related to pipeline parallelism that should be addressed to ensure full feature compatibility.


def resolve_future(self, model_worker_batch: ModelWorkerBatch):
if self.spec_algo.is_eagle():
if self.spec_algo.is_eagle() or self.spec_algo.is_standalone():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition self.spec_algo.is_eagle() or self.spec_algo.is_standalone() is repeated in this file and also in scheduler_output_processor_mixin.py. To improve maintainability, consider adding a helper method to the SpeculativeAlgorithm class in spec_info.py. For example:

def is_eagle_or_standalone(self):
    return self.is_eagle() or self.is_standalone()

This would centralize the logic and make the code cleaner.

bs = len(self.reqs)

if self.is_v2_eagle:
if self.is_v2_eagle or self.is_v2_standalone:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition self.is_v2_eagle or self.is_v2_standalone is repeated in several places (here, maybe_wait_verify_done, and in scheduler.py and scheduler_output_processor_mixin.py). To improve maintainability and reduce code duplication, consider adding a new property to the ScheduleBatch class that encapsulates this logic. For example:

@property
def is_v2_spec(self):
    return self.is_v2_eagle or self.is_v2_standalone

Then you can simplify this condition to if self.is_v2_spec:. This would make the code cleaner and easier to modify in the future if more v2 speculative algorithms are added.

server_args=server_args,
gpu_id=gpu_id,
tp_rank=tp_rank,
pp_rank=0, # FIXME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The hardcoded pp_rank=0 with a FIXME comment indicates that this worker may not function correctly with pipeline parallelism. This should be resolved to properly support pipeline parallel setups.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces standalone_worker_v2 to implement a new speculative decoding algorithm. The changes are mostly about integrating this new algorithm into the existing logic paths. The implementation is functional, but there are several opportunities to improve maintainability by reducing code duplication and repeated conditional logic. My review focuses on refactoring these areas for better code clarity and easier future extensions.

Comment on lines +35 to +112
def __init__(
self,
server_args: ServerArgs,
gpu_id: int,
tp_rank: int,
dp_rank: int,
moe_ep_rank: int,
nccl_port: int,
target_worker: TpModelWorker,
):
# copy args
self.server_args = server_args
self.gpu_id = gpu_id
self.tp_rank = tp_rank
self.dp_rank = dp_rank
self.moe_ep_rank = moe_ep_rank
self.nccl_port = nccl_port
self.target_worker = target_worker

# Args for easy access
self.device = server_args.device
self.topk = server_args.speculative_eagle_topk
self.speculative_num_steps = server_args.speculative_num_steps
self.speculative_num_draft_tokens = server_args.speculative_num_draft_tokens
self.speculative_algorithm = SpeculativeAlgorithm.from_string(
server_args.speculative_algorithm
)

# Set constant
from sglang.srt.speculative.eagle_info import EagleDraftInput
EagleDraftInput.ALLOC_LEN_PER_DECODE = max(
self.speculative_num_steps * self.topk, self.speculative_num_draft_tokens
)

# Do not capture cuda graph in `TpModelWorker` init,
# will capture later with init_cuda_graphs()
backup_disable_cuda_graph = server_args.disable_cuda_graph
server_args.disable_cuda_graph = True

# Share the allocator with a target worker.
# Draft and target worker own their own KV cache pools.
self.req_to_token_pool, self.token_to_kv_pool_allocator = (
target_worker.get_memory_pool()
)
with empty_context():
# Init draft worker
self.draft_worker = TpModelWorker(
server_args=server_args,
gpu_id=gpu_id,
tp_rank=tp_rank,
pp_rank=0, # FIXME
dp_rank=dp_rank,
moe_ep_rank=moe_ep_rank,
nccl_port=nccl_port,
is_draft_worker=True,
req_to_token_pool=self.req_to_token_pool,
token_to_kv_pool_allocator=self.token_to_kv_pool_allocator,
)

# Alias for better readability
self.draft_runner = self.draft_worker.model_runner

self.init_token_map()
self.init_lm_head()

# Init attention backend and cuda graphs
self.draft_runner.server_args.disable_cuda_graph = backup_disable_cuda_graph
self.draft_tp_context = (
draft_tp_context if server_args.enable_dp_attention else empty_context
)
with self.draft_tp_context(self.draft_runner.tp_group):
self.init_attention_backend()
self.init_cuda_graphs()

from sglang.srt.speculative.eagle_utils import TreeMaskMode
self.tree_mask_mode = TreeMaskMode.FULL_MASK

self.plan_stream, self.plan_stream_ctx = _get_plan_stream(self.device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The __init__ method of StandaloneDraftWorker is an exact copy of EagleDraftWorker.__init__. This code duplication can be avoided by removing the __init__ method from StandaloneDraftWorker and letting it inherit from EagleDraftWorker. Since EagleDraftWorker.__init__ calls self.init_lm_head(), your override of init_lm_head will be correctly used, achieving the same goal with much cleaner code.


def _lazy_init_buf(self, draft_input: EagleDraftInput):
if self.buf_initialized or not self.spec_algo.is_eagle():
if self.buf_initialized or (not self.spec_algo.is_eagle() and not self.spec_algo.is_standalone()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With the suggested is_v2_algo method in SpeculativeAlgorithm, this condition can be simplified.

Suggested change
if self.buf_initialized or (not self.spec_algo.is_eagle() and not self.spec_algo.is_standalone()):
if self.buf_initialized or not self.spec_algo.is_v2_algo():


def resolve_future(self, model_worker_batch: ModelWorkerBatch):
if self.spec_algo.is_eagle():
if self.spec_algo.is_eagle() or self.spec_algo.is_standalone():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With the suggested is_v2_algo method in SpeculativeAlgorithm, this condition can be simplified.

Suggested change
if self.spec_algo.is_eagle() or self.spec_algo.is_standalone():
if self.spec_algo.is_v2_algo():

):
intv = future_indices.interval
if self.spec_algo.is_eagle():
if self.spec_algo.is_eagle() or self.spec_algo.is_standalone():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With the suggested is_v2_algo method in SpeculativeAlgorithm, this condition can be simplified.

Suggested change
if self.spec_algo.is_eagle() or self.spec_algo.is_standalone():
if self.spec_algo.is_v2_algo():

@property
def is_v2_standalone(self):
return self.enable_overlap and self.spec_algorithm.is_standalone()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Following the suggestion to add is_v2_algo to SpeculativeAlgorithm, you can introduce a corresponding is_v2_spec property here. This will centralize the logic for checking v2 speculative algorithms and simplify conditions in this and other files.

Suggested change
@property
def is_v2_spec(self):
return self.enable_overlap and self.spec_algorithm.is_v2_algo()

future_indices_or_next_token_ids = -future_indices.indices

if batch.is_v2_eagle:
if batch.is_v2_eagle or batch.is_v2_standalone:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With the new is_v2_spec property on ScheduleBatch, this can be simplified.

Suggested change
if batch.is_v2_eagle or batch.is_v2_standalone:
if batch.is_v2_spec:

if batch.return_logprob:
next_token_logprobs = logits_output.next_token_logprobs.tolist()
elif batch.is_v2_eagle:
elif batch.is_v2_eagle or batch.is_v2_standalone:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With the new is_v2_spec property on ScheduleBatch, this can be simplified.

Suggested change
elif batch.is_v2_eagle or batch.is_v2_standalone:
elif batch.is_v2_spec:

if self.enable_overlap and (req.finished() or req.is_retracted):
indices_to_free = None
if batch.spec_algorithm.is_eagle():
if batch.spec_algorithm.is_eagle() or batch.spec_algorithm.is_standalone():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With the suggested is_v2_algo method in SpeculativeAlgorithm, this condition can be simplified.

Suggested change
if batch.spec_algorithm.is_eagle() or batch.spec_algorithm.is_standalone():
if batch.spec_algorithm.is_v2_algo():

if batch.spec_algorithm.is_none():
req.output_ids.append(next_token_id)
elif batch.is_v2_eagle:
elif batch.is_v2_eagle or batch.is_v2_standalone:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With the new is_v2_spec property on ScheduleBatch, this can be simplified.

Suggested change
elif batch.is_v2_eagle or batch.is_v2_standalone:
elif batch.is_v2_spec:


if req.finished():
if batch.is_v2_eagle and self.cur_batch.forward_mode.is_extend():
if (batch.is_v2_eagle or batch.is_v2_standalone) and self.cur_batch.forward_mode.is_extend():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With the new is_v2_spec property on ScheduleBatch, this can be simplified.

Suggested change
if (batch.is_v2_eagle or batch.is_v2_standalone) and self.cur_batch.forward_mode.is_extend():
if batch.is_v2_spec and self.cur_batch.forward_mode.is_extend():

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces standalone_worker_v2 to enable speculative decoding with standalone draft models in an overlapped execution mode. The changes primarily involve extending existing logic for the 'EAGLE' algorithm to also support the new 'STANDALONE' algorithm. The main implementation resides in the new standalone_worker_v2.py file.

My review focuses on improving code maintainability by reducing duplication. I've identified several areas where repeated conditional logic can be refactored into helper properties or methods. Specifically, I've suggested creating a new property in schedule_batch.py to simplify checks for V2 speculative overlap. In the new standalone_worker_v2.py, I've recommended using super().__init__ to reduce significant code duplication in the __init__ methods of StandaloneDraftWorker and StandaloneWorkerV2. Additionally, I've included a minor style suggestion in server_args.py to improve readability. These changes will make the codebase cleaner and easier to maintain.

Comment on lines +1570 to +1577
def is_v2_standalone(self):
return self.enable_overlap and self.spec_algorithm.is_standalone()

def prepare_for_decode(self):
self.forward_mode = ForwardMode.DECODE
bs = len(self.reqs)

if self.is_v2_eagle:
if self.is_v2_eagle or self.is_v2_standalone:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve maintainability and reduce code duplication, you can introduce a new property is_v2_speculative_overlap that combines the checks for is_v2_eagle and is_v2_standalone. This new property can then be used here and in other places like maybe_wait_verify_done in this file, as well as in scheduler_output_processor_mixin.py, to simplify the conditional logic.

Suggested change
def is_v2_standalone(self):
return self.enable_overlap and self.spec_algorithm.is_standalone()
def prepare_for_decode(self):
self.forward_mode = ForwardMode.DECODE
bs = len(self.reqs)
if self.is_v2_eagle:
if self.is_v2_eagle or self.is_v2_standalone:
def is_v2_standalone(self):
return self.enable_overlap and self.spec_algorithm.is_standalone()
@property
def is_v2_speculative_overlap(self):
return self.is_v2_eagle or self.is_v2_standalone
def prepare_for_decode(self):
self.forward_mode = ForwardMode.DECODE
bs = len(self.reqs)
if self.is_v2_speculative_overlap:

Comment on lines +1511 to +1512
(self.speculative_algorithm == "EAGLE"
or self.speculative_algorithm == "STANDALONE")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and conciseness, you can use the in operator with a set for this check.

                self.speculative_algorithm in {"EAGLE", "STANDALONE"}

Comment on lines +35 to +112
def __init__(
self,
server_args: ServerArgs,
gpu_id: int,
tp_rank: int,
dp_rank: int,
moe_ep_rank: int,
nccl_port: int,
target_worker: TpModelWorker,
):
# copy args
self.server_args = server_args
self.gpu_id = gpu_id
self.tp_rank = tp_rank
self.dp_rank = dp_rank
self.moe_ep_rank = moe_ep_rank
self.nccl_port = nccl_port
self.target_worker = target_worker

# Args for easy access
self.device = server_args.device
self.topk = server_args.speculative_eagle_topk
self.speculative_num_steps = server_args.speculative_num_steps
self.speculative_num_draft_tokens = server_args.speculative_num_draft_tokens
self.speculative_algorithm = SpeculativeAlgorithm.from_string(
server_args.speculative_algorithm
)

# Set constant
from sglang.srt.speculative.eagle_info import EagleDraftInput
EagleDraftInput.ALLOC_LEN_PER_DECODE = max(
self.speculative_num_steps * self.topk, self.speculative_num_draft_tokens
)

# Do not capture cuda graph in `TpModelWorker` init,
# will capture later with init_cuda_graphs()
backup_disable_cuda_graph = server_args.disable_cuda_graph
server_args.disable_cuda_graph = True

# Share the allocator with a target worker.
# Draft and target worker own their own KV cache pools.
self.req_to_token_pool, self.token_to_kv_pool_allocator = (
target_worker.get_memory_pool()
)
with empty_context():
# Init draft worker
self.draft_worker = TpModelWorker(
server_args=server_args,
gpu_id=gpu_id,
tp_rank=tp_rank,
pp_rank=0, # FIXME
dp_rank=dp_rank,
moe_ep_rank=moe_ep_rank,
nccl_port=nccl_port,
is_draft_worker=True,
req_to_token_pool=self.req_to_token_pool,
token_to_kv_pool_allocator=self.token_to_kv_pool_allocator,
)

# Alias for better readability
self.draft_runner = self.draft_worker.model_runner

self.init_token_map()
self.init_lm_head()

# Init attention backend and cuda graphs
self.draft_runner.server_args.disable_cuda_graph = backup_disable_cuda_graph
self.draft_tp_context = (
draft_tp_context if server_args.enable_dp_attention else empty_context
)
with self.draft_tp_context(self.draft_runner.tp_group):
self.init_attention_backend()
self.init_cuda_graphs()

from sglang.srt.speculative.eagle_utils import TreeMaskMode
self.tree_mask_mode = TreeMaskMode.FULL_MASK

self.plan_stream, self.plan_stream_ctx = _get_plan_stream(self.device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The __init__ method of StandaloneDraftWorker is almost a complete copy of EagleDraftWorker.__init__. To avoid this large code duplication and improve maintainability, you can call super().__init__ and rely on the parent class's implementation. The only functional difference, which is not sharing the LM head, is already correctly handled by overriding init_lm_head.

Suggested change
def __init__(
self,
server_args: ServerArgs,
gpu_id: int,
tp_rank: int,
dp_rank: int,
moe_ep_rank: int,
nccl_port: int,
target_worker: TpModelWorker,
):
# copy args
self.server_args = server_args
self.gpu_id = gpu_id
self.tp_rank = tp_rank
self.dp_rank = dp_rank
self.moe_ep_rank = moe_ep_rank
self.nccl_port = nccl_port
self.target_worker = target_worker
# Args for easy access
self.device = server_args.device
self.topk = server_args.speculative_eagle_topk
self.speculative_num_steps = server_args.speculative_num_steps
self.speculative_num_draft_tokens = server_args.speculative_num_draft_tokens
self.speculative_algorithm = SpeculativeAlgorithm.from_string(
server_args.speculative_algorithm
)
# Set constant
from sglang.srt.speculative.eagle_info import EagleDraftInput
EagleDraftInput.ALLOC_LEN_PER_DECODE = max(
self.speculative_num_steps * self.topk, self.speculative_num_draft_tokens
)
# Do not capture cuda graph in `TpModelWorker` init,
# will capture later with init_cuda_graphs()
backup_disable_cuda_graph = server_args.disable_cuda_graph
server_args.disable_cuda_graph = True
# Share the allocator with a target worker.
# Draft and target worker own their own KV cache pools.
self.req_to_token_pool, self.token_to_kv_pool_allocator = (
target_worker.get_memory_pool()
)
with empty_context():
# Init draft worker
self.draft_worker = TpModelWorker(
server_args=server_args,
gpu_id=gpu_id,
tp_rank=tp_rank,
pp_rank=0, # FIXME
dp_rank=dp_rank,
moe_ep_rank=moe_ep_rank,
nccl_port=nccl_port,
is_draft_worker=True,
req_to_token_pool=self.req_to_token_pool,
token_to_kv_pool_allocator=self.token_to_kv_pool_allocator,
)
# Alias for better readability
self.draft_runner = self.draft_worker.model_runner
self.init_token_map()
self.init_lm_head()
# Init attention backend and cuda graphs
self.draft_runner.server_args.disable_cuda_graph = backup_disable_cuda_graph
self.draft_tp_context = (
draft_tp_context if server_args.enable_dp_attention else empty_context
)
with self.draft_tp_context(self.draft_runner.tp_group):
self.init_attention_backend()
self.init_cuda_graphs()
from sglang.srt.speculative.eagle_utils import TreeMaskMode
self.tree_mask_mode = TreeMaskMode.FULL_MASK
self.plan_stream, self.plan_stream_ctx = _get_plan_stream(self.device)
def __init__(
self,
server_args: ServerArgs,
gpu_id: int,
tp_rank: int,
dp_rank: int,
moe_ep_rank: int,
nccl_port: int,
target_worker: TpModelWorker,
):
super().__init__(
server_args,
gpu_id,
tp_rank,
dp_rank,
moe_ep_rank,
nccl_port,
target_worker,
)

Comment on lines +123 to +165
def __init__(
self,
server_args: ServerArgs,
gpu_id: int,
tp_rank: int,
dp_rank: Optional[int],
moe_ep_rank: int,
nccl_port: int,
target_worker: TpModelWorker,
):
# Parse arguments
self.server_args = server_args
self.topk = server_args.speculative_eagle_topk
self.speculative_num_steps = server_args.speculative_num_steps
self.speculative_num_draft_tokens = server_args.speculative_num_draft_tokens
self.enable_nan_detection = server_args.enable_nan_detection
self.gpu_id = gpu_id
self.device = server_args.device
self._target_worker = target_worker
self.page_size = server_args.page_size
self.speculative_algorithm = SpeculativeAlgorithm.from_string(
server_args.speculative_algorithm
)

self.req_to_token_pool, self.token_to_kv_pool_allocator = (
target_worker.get_memory_pool()
)

# Override the context length of the draft model to be the same as the target model.
server_args.context_length = target_worker.model_runner.model_config.context_len

# Create our custom draft worker that doesn't share embeddings/lm_head
self._draft_worker = StandaloneDraftWorker(
server_args, gpu_id, tp_rank, dp_rank, moe_ep_rank, nccl_port, target_worker
)

# Some dummy tensors
self.num_new_pages_per_topk = torch.empty(
(), dtype=torch.int64, device=self.device
)
self.extend_lens = torch.empty((), dtype=torch.int64, device=self.device)

self.plan_stream, self.plan_stream_ctx = _get_plan_stream(self.device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The __init__ method of StandaloneWorkerV2 is a near-exact copy of EAGLEWorkerV2.__init__, with the only difference being the instantiation of StandaloneDraftWorker instead of EagleDraftWorker. To reduce code duplication and make the code more maintainable, you can call super().__init__ and then just override the self._draft_worker attribute.

    def __init__(
        self,
        server_args: ServerArgs,
        gpu_id: int,
        tp_rank: int,
        dp_rank: Optional[int],
        moe_ep_rank: int,
        nccl_port: int,
        target_worker: TpModelWorker,
    ):
        super().__init__(
            server_args,
            gpu_id,
            tp_rank,
            dp_rank,
            moe_ep_rank,
            nccl_port,
            target_worker,
        )
        # Create our custom draft worker that doesn't share embeddings/lm_head
        self._draft_worker = StandaloneDraftWorker(
            server_args, gpu_id, tp_rank, dp_rank, moe_ep_rank, nccl_port, target_worker
        )

@attack204 attack204 changed the title feat(SpecEagleV2): add standalone_worker_v2 feat(SpecEagleV2): add standalone_worker_v2(WIP) Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant