feat(SpecEagleV2): add standalone_worker_v2(WIP) #12625

attack204 · 2025-11-04T11:48:02Z

Motivation

Implement standaloneV2 and enable it to run the test.send_one test (Done, But the performance of standaloneV2 is worse than standalone.)
Performance optimization: Investigate the reasons for performance regression and achieve performance improvements
Unit testing: Add a unit test for standaloneV2

Test And Benchmark

Env: 1*H200
standaloneV2

export CUDA_VISIBLE_DEVICES=1
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
export SGLANG_ENABLE_SPEC_V2=1
MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
SPEC_MODEL=meta-llama/Llama-3.2-1B-Instruct
PORT=30000
python3 -m sglang.launch_server \
    --dtype float16 \
    --model $MODEL \
    --speculative-algo STANDALONE \
    --speculative-draft-model-path $SPEC_MODEL \
    --attention-backend triton \
    --cuda-graph-bs $(seq -s ' ' 1 8) \
    --trust-remote-code \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 6 \
    --mem-fraction-static 0.5 \
    --port $PORT

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
SPEC_MODEL=meta-llama/Llama-3.2-1B-Instruct
PORT=30000
python3 -m sglang.test.send_one --profile

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.893    |  512   |   4.571    |     177.00      |
+-------------+--------+------------+-----------------+

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    3.308    |  512   |   4.571    |     154.77      |
+-------------+--------+------------+-----------------+

export CUDA_VISIBLE_DEVICES=1
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
export SGLANG_ENABLE_SPEC_V2=0
MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
SPEC_MODEL=meta-llama/Llama-3.2-1B-Instruct
PORT=30000
python3 -m sglang.launch_server \
    --model $MODEL \
    --speculative-algo STANDALONE \
    --speculative-draft-model-path $SPEC_MODEL \
    --attention-backend triton \
    --trust-remote-code \
    --cuda-graph-bs $(seq -s ' ' 1 8) \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 6 \
    --mem-fraction-static 0.5 \
    --port $PORT

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
SPEC_MODEL=meta-llama/Llama-3.2-1B-Instruct
PORT=30000
python3 -m sglang.test.send_one --profile


+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.742    |  512   |   4.876    |     186.71      |
+-------------+--------+------------+-----------------+

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    3.135    |  512   |   3.606    |     163.34      |
+-------------+--------+------------+-----------------+

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-11-04T11:48:48Z

Summary of Changes

Hello @attack204, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces standalone_worker_v2, a new iteration of the standalone speculative decoding algorithm, designed to integrate with the system's overlap utility features. The changes involve extending existing EAGLE algorithm logic to support standaloneV2 across various core components, including overlap utilities, batch scheduling, and output processing. A dedicated draft worker is also implemented for this new version. While the new implementation is functional and passes basic tests, initial performance benchmarks show a regression compared to the previous version, indicating that further optimization efforts are required.

Highlights

New Speculative Algorithm Implementation: A new version of the 'standalone' speculative decoding algorithm, standalone_worker_v2, has been introduced. This implementation largely mirrors the structure of EAGLEWorkerV2 and aims to integrate with the system's overlap utility.
Overlap Utility Integration: The standalone_worker_v2 algorithm is now integrated into existing overlap utility functions (_lazy_init_buf, resolve_future, store_to_map), allowing it to leverage shared buffer initialization, future index resolution, and result storage mechanisms previously used by the EAGLE algorithm.
Scheduler and Output Processing Alignment: Updates to the scheduler and output processing logic ensure that standalone_worker_v2 is handled consistently with EAGLEV2 for tasks such as decode preparation, verification synchronization, batch execution, and token output resolution.
Dedicated Draft Worker: A StandaloneDraftWorker has been added, which is a custom EagleDraftWorker designed to operate with its own embeddings and language model head, rather than sharing them with the target model. This allows for more independent draft model operation within the standalone_worker_v2 framework.
Initial Performance Observations: Initial benchmarks indicate that standalone_worker_v2 currently exhibits worse performance compared to the original standalone implementation, with plans for future performance optimization and unit testing.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist · 2025-11-04T11:49:12Z

Summary of Changes

Hello @attack204, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new version of the standalone speculative decoding worker, standalone_worker_v2, into the system. This new implementation aligns the standalone algorithm with the existing V2 speculative decoding framework, which was previously used by the EAGLE algorithm. The changes involve creating dedicated worker classes and updating core scheduling and output processing logic to support this new worker. While the functionality is in place and tested, there's a noted performance regression that will be addressed in subsequent work.

Highlights

New Speculative Decoding Worker: Introduced standalone_worker_v2 to support a new version of the standalone speculative decoding algorithm, leveraging the V2 speculative decoding infrastructure.
Integration with V2 Speculative Path: Modified various manager components (overlap_utils.py, schedule_batch.py, scheduler.py, scheduler_output_processor_mixin.py) to recognize and utilize the STANDALONE algorithm within the V2 speculative decoding flow, similar to the EAGLE algorithm.
Custom Draft Worker: Added StandaloneDraftWorker which extends EagleDraftWorker but is specifically designed not to share embeddings or the language model head with the target model, allowing for more independent draft model operation.
Performance Note: Initial testing indicates that the performance of standaloneV2 is currently worse than the original standalone implementation, with plans for future performance optimization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist · 2025-11-04T11:50:57Z

Summary of Changes

Hello @attack204, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, more modular version of the 'standalone' speculative decoding worker, standalone_worker_v2, designed to leverage the advanced V2 speculative decoding infrastructure. It establishes a custom draft worker that operates independently of the target model's embeddings and LM head, allowing for greater architectural flexibility. While the core functionality is in place and testable, the current focus is on integrating it into the existing system, with performance optimizations planned for subsequent iterations.

Highlights

New Speculative Decoding Worker: Introduced standalone_worker_v2 to enable a new version of the 'standalone' speculative decoding algorithm, integrating it into the existing V2 speculative decoding framework.
Custom Draft Worker: Added StandaloneDraftWorker, a specialized draft worker that manages its own embeddings and LM head, distinct from the target model, providing more flexibility for the standalone V2 implementation.
V2 Speculative Path Integration: Extended the V2 speculative decoding logic in various manager components (overlap_utils, schedule_batch, scheduler, scheduler_output_processor_mixin) to support the STANDALONE algorithm alongside EAGLE and EAGLE3.
Performance Note: Initial testing indicates that standaloneV2 currently exhibits worse performance (higher latency, lower token/s) compared to the original standalone implementation, with performance optimization identified as a future task.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces standalone_worker_v2 to support the standalone speculative decoding algorithm with overlap scheduling (SpecEagleV2). The changes primarily involve extending existing logic for the 'EAGLE' algorithm to also cover the 'STANDALONE' algorithm across various manager and utility files. A new file standalone_worker_v2.py is added, which defines the v2 worker.

The implementation looks solid and follows the existing patterns. My main feedback is around improving maintainability by reducing code duplication. I've pointed out a couple of repeated conditional checks that could be refactored into helper properties or methods. Additionally, there's a FIXME in the new worker related to pipeline parallelism that should be addressed to ensure full feature compatibility.

gemini-code-assist · 2025-11-04T11:51:07Z

python/sglang/srt/managers/overlap_utils.py


    def resolve_future(self, model_worker_batch: ModelWorkerBatch):
-        if self.spec_algo.is_eagle():
+        if self.spec_algo.is_eagle() or self.spec_algo.is_standalone():


The condition self.spec_algo.is_eagle() or self.spec_algo.is_standalone() is repeated in this file and also in scheduler_output_processor_mixin.py. To improve maintainability, consider adding a helper method to the SpeculativeAlgorithm class in spec_info.py. For example:

def is_eagle_or_standalone(self): return self.is_eagle() or self.is_standalone()

This would centralize the logic and make the code cleaner.

gemini-code-assist · 2025-11-04T11:51:08Z

python/sglang/srt/managers/schedule_batch.py

        bs = len(self.reqs)

-        if self.is_v2_eagle:
+        if self.is_v2_eagle or self.is_v2_standalone:


The condition self.is_v2_eagle or self.is_v2_standalone is repeated in several places (here, maybe_wait_verify_done, and in scheduler.py and scheduler_output_processor_mixin.py). To improve maintainability and reduce code duplication, consider adding a new property to the ScheduleBatch class that encapsulates this logic. For example:

@property def is_v2_spec(self): return self.is_v2_eagle or self.is_v2_standalone

Then you can simplify this condition to if self.is_v2_spec:. This would make the code cleaner and easier to modify in the future if more v2 speculative algorithms are added.

gemini-code-assist · 2025-11-04T11:51:08Z

python/sglang/srt/speculative/standalone_worker_v2.py

+                server_args=server_args,
+                gpu_id=gpu_id,
+                tp_rank=tp_rank,
+                pp_rank=0,  # FIXME


The hardcoded pp_rank=0 with a FIXME comment indicates that this worker may not function correctly with pipeline parallelism. This should be resolved to properly support pipeline parallel setups.

gemini-code-assist

Code Review

This pull request introduces standalone_worker_v2 to implement a new speculative decoding algorithm. The changes are mostly about integrating this new algorithm into the existing logic paths. The implementation is functional, but there are several opportunities to improve maintainability by reducing code duplication and repeated conditional logic. My review focuses on refactoring these areas for better code clarity and easier future extensions.

gemini-code-assist · 2025-11-04T11:52:12Z

python/sglang/srt/speculative/standalone_worker_v2.py

+    def __init__(
+        self,
+        server_args: ServerArgs,
+        gpu_id: int,
+        tp_rank: int,
+        dp_rank: int,
+        moe_ep_rank: int,
+        nccl_port: int,
+        target_worker: TpModelWorker,
+    ):
+        # copy args
+        self.server_args = server_args
+        self.gpu_id = gpu_id
+        self.tp_rank = tp_rank
+        self.dp_rank = dp_rank
+        self.moe_ep_rank = moe_ep_rank
+        self.nccl_port = nccl_port
+        self.target_worker = target_worker
+
+        # Args for easy access
+        self.device = server_args.device
+        self.topk = server_args.speculative_eagle_topk
+        self.speculative_num_steps = server_args.speculative_num_steps
+        self.speculative_num_draft_tokens = server_args.speculative_num_draft_tokens
+        self.speculative_algorithm = SpeculativeAlgorithm.from_string(
+            server_args.speculative_algorithm
+        )
+
+        # Set constant
+        from sglang.srt.speculative.eagle_info import EagleDraftInput
+        EagleDraftInput.ALLOC_LEN_PER_DECODE = max(
+            self.speculative_num_steps * self.topk, self.speculative_num_draft_tokens
+        )
+
+        # Do not capture cuda graph in `TpModelWorker` init,
+        # will capture later with init_cuda_graphs()
+        backup_disable_cuda_graph = server_args.disable_cuda_graph
+        server_args.disable_cuda_graph = True
+
+        # Share the allocator with a target worker.
+        # Draft and target worker own their own KV cache pools.
+        self.req_to_token_pool, self.token_to_kv_pool_allocator = (
+            target_worker.get_memory_pool()
+        )
+        with empty_context():
+            # Init draft worker
+            self.draft_worker = TpModelWorker(
+                server_args=server_args,
+                gpu_id=gpu_id,
+                tp_rank=tp_rank,
+                pp_rank=0,  # FIXME
+                dp_rank=dp_rank,
+                moe_ep_rank=moe_ep_rank,
+                nccl_port=nccl_port,
+                is_draft_worker=True,
+                req_to_token_pool=self.req_to_token_pool,
+                token_to_kv_pool_allocator=self.token_to_kv_pool_allocator,
+            )
+
+        # Alias for better readability
+        self.draft_runner = self.draft_worker.model_runner
+
+        self.init_token_map()
+        self.init_lm_head()
+
+        # Init attention backend and cuda graphs
+        self.draft_runner.server_args.disable_cuda_graph = backup_disable_cuda_graph
+        self.draft_tp_context = (
+            draft_tp_context if server_args.enable_dp_attention else empty_context
+        )
+        with self.draft_tp_context(self.draft_runner.tp_group):
+            self.init_attention_backend()
+            self.init_cuda_graphs()
+
+        from sglang.srt.speculative.eagle_utils import TreeMaskMode
+        self.tree_mask_mode = TreeMaskMode.FULL_MASK
+
+        self.plan_stream, self.plan_stream_ctx = _get_plan_stream(self.device)


The __init__ method of StandaloneDraftWorker is an exact copy of EagleDraftWorker.__init__. This code duplication can be avoided by removing the __init__ method from StandaloneDraftWorker and letting it inherit from EagleDraftWorker. Since EagleDraftWorker.__init__ calls self.init_lm_head(), your override of init_lm_head will be correctly used, achieving the same goal with much cleaner code.

gemini-code-assist · 2025-11-04T11:52:12Z

python/sglang/srt/managers/overlap_utils.py


    def _lazy_init_buf(self, draft_input: EagleDraftInput):
-        if self.buf_initialized or not self.spec_algo.is_eagle():
+        if self.buf_initialized or (not self.spec_algo.is_eagle() and not self.spec_algo.is_standalone()):


With the suggested is_v2_algo method in SpeculativeAlgorithm, this condition can be simplified.

Suggested change

if self.buf_initialized or (not self.spec_algo.is_eagle() and not self.spec_algo.is_standalone()):

if self.buf_initialized or not self.spec_algo.is_v2_algo():

gemini-code-assist · 2025-11-04T11:52:12Z

python/sglang/srt/managers/overlap_utils.py


    def resolve_future(self, model_worker_batch: ModelWorkerBatch):
-        if self.spec_algo.is_eagle():
+        if self.spec_algo.is_eagle() or self.spec_algo.is_standalone():


With the suggested is_v2_algo method in SpeculativeAlgorithm, this condition can be simplified.

Suggested change

if self.spec_algo.is_eagle() or self.spec_algo.is_standalone():

if self.spec_algo.is_v2_algo():

gemini-code-assist · 2025-11-04T11:52:12Z

python/sglang/srt/managers/overlap_utils.py

    ):
        intv = future_indices.interval
-        if self.spec_algo.is_eagle():
+        if self.spec_algo.is_eagle() or self.spec_algo.is_standalone():


With the suggested is_v2_algo method in SpeculativeAlgorithm, this condition can be simplified.

Suggested change

if self.spec_algo.is_eagle() or self.spec_algo.is_standalone():

if self.spec_algo.is_v2_algo():

gemini-code-assist · 2025-11-04T11:52:12Z

python/sglang/srt/managers/schedule_batch.py

+    @property
+    def is_v2_standalone(self):
+        return self.enable_overlap and self.spec_algorithm.is_standalone()
+


Following the suggestion to add is_v2_algo to SpeculativeAlgorithm, you can introduce a corresponding is_v2_spec property here. This will centralize the logic for checking v2 speculative algorithms and simplify conditions in this and other files.

Suggested change

@property

def is_v2_spec(self):

return self.enable_overlap and self.spec_algorithm.is_v2_algo()

gemini-code-assist · 2025-11-04T11:52:13Z

python/sglang/srt/managers/scheduler.py

                future_indices_or_next_token_ids = -future_indices.indices

-                if batch.is_v2_eagle:
+                if batch.is_v2_eagle or batch.is_v2_standalone:


With the new is_v2_spec property on ScheduleBatch, this can be simplified.

Suggested change

if batch.is_v2_eagle or batch.is_v2_standalone:

if batch.is_v2_spec:

gemini-code-assist · 2025-11-04T11:52:13Z

python/sglang/srt/managers/scheduler_output_processor_mixin.py

            if batch.return_logprob:
                next_token_logprobs = logits_output.next_token_logprobs.tolist()
-        elif batch.is_v2_eagle:
+        elif batch.is_v2_eagle or batch.is_v2_standalone:


With the new is_v2_spec property on ScheduleBatch, this can be simplified.

Suggested change

elif batch.is_v2_eagle or batch.is_v2_standalone:

elif batch.is_v2_spec:

gemini-code-assist · 2025-11-04T11:52:13Z

python/sglang/srt/managers/scheduler_output_processor_mixin.py

            if self.enable_overlap and (req.finished() or req.is_retracted):
                indices_to_free = None
-                if batch.spec_algorithm.is_eagle():
+                if batch.spec_algorithm.is_eagle() or batch.spec_algorithm.is_standalone():


With the suggested is_v2_algo method in SpeculativeAlgorithm, this condition can be simplified.

Suggested change

if batch.spec_algorithm.is_eagle() or batch.spec_algorithm.is_standalone():

if batch.spec_algorithm.is_v2_algo():

gemini-code-assist · 2025-11-04T11:52:13Z

python/sglang/srt/managers/scheduler_output_processor_mixin.py

            if batch.spec_algorithm.is_none():
                req.output_ids.append(next_token_id)
-            elif batch.is_v2_eagle:
+            elif batch.is_v2_eagle or batch.is_v2_standalone:


With the new is_v2_spec property on ScheduleBatch, this can be simplified.

Suggested change

elif batch.is_v2_eagle or batch.is_v2_standalone:

elif batch.is_v2_spec:

gemini-code-assist · 2025-11-04T11:52:13Z

python/sglang/srt/managers/scheduler_output_processor_mixin.py


            if req.finished():
-                if batch.is_v2_eagle and self.cur_batch.forward_mode.is_extend():
+                if (batch.is_v2_eagle or batch.is_v2_standalone) and self.cur_batch.forward_mode.is_extend():


With the new is_v2_spec property on ScheduleBatch, this can be simplified.

Suggested change

if (batch.is_v2_eagle or batch.is_v2_standalone) and self.cur_batch.forward_mode.is_extend():

if batch.is_v2_spec and self.cur_batch.forward_mode.is_extend():

gemini-code-assist

Code Review

This pull request introduces standalone_worker_v2 to enable speculative decoding with standalone draft models in an overlapped execution mode. The changes primarily involve extending existing logic for the 'EAGLE' algorithm to also support the new 'STANDALONE' algorithm. The main implementation resides in the new standalone_worker_v2.py file.

My review focuses on improving code maintainability by reducing duplication. I've identified several areas where repeated conditional logic can be refactored into helper properties or methods. Specifically, I've suggested creating a new property in schedule_batch.py to simplify checks for V2 speculative overlap. In the new standalone_worker_v2.py, I've recommended using super().__init__ to reduce significant code duplication in the __init__ methods of StandaloneDraftWorker and StandaloneWorkerV2. Additionally, I've included a minor style suggestion in server_args.py to improve readability. These changes will make the codebase cleaner and easier to maintain.

gemini-code-assist · 2025-11-04T11:53:05Z

python/sglang/srt/managers/schedule_batch.py

+    def is_v2_standalone(self):
+        return self.enable_overlap and self.spec_algorithm.is_standalone()
+
    def prepare_for_decode(self):
        self.forward_mode = ForwardMode.DECODE
        bs = len(self.reqs)

-        if self.is_v2_eagle:
+        if self.is_v2_eagle or self.is_v2_standalone:


To improve maintainability and reduce code duplication, you can introduce a new property is_v2_speculative_overlap that combines the checks for is_v2_eagle and is_v2_standalone. This new property can then be used here and in other places like maybe_wait_verify_done in this file, as well as in scheduler_output_processor_mixin.py, to simplify the conditional logic.

Suggested change

def is_v2_standalone(self):

return self.enable_overlap and self.spec_algorithm.is_standalone()

def prepare_for_decode(self):

self.forward_mode = ForwardMode.DECODE

bs = len(self.reqs)

if self.is_v2_eagle:

if self.is_v2_eagle or self.is_v2_standalone:

def is_v2_standalone(self):

return self.enable_overlap and self.spec_algorithm.is_standalone()

@property

def is_v2_speculative_overlap(self):

return self.is_v2_eagle or self.is_v2_standalone

def prepare_for_decode(self):

self.forward_mode = ForwardMode.DECODE

bs = len(self.reqs)

if self.is_v2_speculative_overlap:

gemini-code-assist · 2025-11-04T11:53:05Z

python/sglang/srt/server_args.py

+                (self.speculative_algorithm == "EAGLE"
+                or self.speculative_algorithm == "STANDALONE")


For better readability and conciseness, you can use the in operator with a set for this check.

self.speculative_algorithm in {"EAGLE", "STANDALONE"}

gemini-code-assist · 2025-11-04T11:53:05Z

python/sglang/srt/speculative/standalone_worker_v2.py

+    def __init__(
+        self,
+        server_args: ServerArgs,
+        gpu_id: int,
+        tp_rank: int,
+        dp_rank: int,
+        moe_ep_rank: int,
+        nccl_port: int,
+        target_worker: TpModelWorker,
+    ):
+        # copy args
+        self.server_args = server_args
+        self.gpu_id = gpu_id
+        self.tp_rank = tp_rank
+        self.dp_rank = dp_rank
+        self.moe_ep_rank = moe_ep_rank
+        self.nccl_port = nccl_port
+        self.target_worker = target_worker
+
+        # Args for easy access
+        self.device = server_args.device
+        self.topk = server_args.speculative_eagle_topk
+        self.speculative_num_steps = server_args.speculative_num_steps
+        self.speculative_num_draft_tokens = server_args.speculative_num_draft_tokens
+        self.speculative_algorithm = SpeculativeAlgorithm.from_string(
+            server_args.speculative_algorithm
+        )
+
+        # Set constant
+        from sglang.srt.speculative.eagle_info import EagleDraftInput
+        EagleDraftInput.ALLOC_LEN_PER_DECODE = max(
+            self.speculative_num_steps * self.topk, self.speculative_num_draft_tokens
+        )
+
+        # Do not capture cuda graph in `TpModelWorker` init,
+        # will capture later with init_cuda_graphs()
+        backup_disable_cuda_graph = server_args.disable_cuda_graph
+        server_args.disable_cuda_graph = True
+
+        # Share the allocator with a target worker.
+        # Draft and target worker own their own KV cache pools.
+        self.req_to_token_pool, self.token_to_kv_pool_allocator = (
+            target_worker.get_memory_pool()
+        )
+        with empty_context():
+            # Init draft worker
+            self.draft_worker = TpModelWorker(
+                server_args=server_args,
+                gpu_id=gpu_id,
+                tp_rank=tp_rank,
+                pp_rank=0,  # FIXME
+                dp_rank=dp_rank,
+                moe_ep_rank=moe_ep_rank,
+                nccl_port=nccl_port,
+                is_draft_worker=True,
+                req_to_token_pool=self.req_to_token_pool,
+                token_to_kv_pool_allocator=self.token_to_kv_pool_allocator,
+            )
+
+        # Alias for better readability
+        self.draft_runner = self.draft_worker.model_runner
+
+        self.init_token_map()
+        self.init_lm_head()
+
+        # Init attention backend and cuda graphs
+        self.draft_runner.server_args.disable_cuda_graph = backup_disable_cuda_graph
+        self.draft_tp_context = (
+            draft_tp_context if server_args.enable_dp_attention else empty_context
+        )
+        with self.draft_tp_context(self.draft_runner.tp_group):
+            self.init_attention_backend()
+            self.init_cuda_graphs()
+
+        from sglang.srt.speculative.eagle_utils import TreeMaskMode
+        self.tree_mask_mode = TreeMaskMode.FULL_MASK
+
+        self.plan_stream, self.plan_stream_ctx = _get_plan_stream(self.device)


The __init__ method of StandaloneDraftWorker is almost a complete copy of EagleDraftWorker.__init__. To avoid this large code duplication and improve maintainability, you can call super().__init__ and rely on the parent class's implementation. The only functional difference, which is not sharing the LM head, is already correctly handled by overriding init_lm_head.

Suggested change

def __init__(

self,

server_args: ServerArgs,

gpu_id: int,

tp_rank: int,

dp_rank: int,

moe_ep_rank: int,

nccl_port: int,

target_worker: TpModelWorker,

):

# copy args

self.server_args = server_args

self.gpu_id = gpu_id

self.tp_rank = tp_rank

self.dp_rank = dp_rank

self.moe_ep_rank = moe_ep_rank

self.nccl_port = nccl_port

self.target_worker = target_worker

# Args for easy access

self.device = server_args.device

self.topk = server_args.speculative_eagle_topk

self.speculative_num_steps = server_args.speculative_num_steps

self.speculative_num_draft_tokens = server_args.speculative_num_draft_tokens

self.speculative_algorithm = SpeculativeAlgorithm.from_string(

server_args.speculative_algorithm

)

# Set constant

from sglang.srt.speculative.eagle_info import EagleDraftInput

EagleDraftInput.ALLOC_LEN_PER_DECODE = max(

self.speculative_num_steps * self.topk, self.speculative_num_draft_tokens

)

# Do not capture cuda graph in `TpModelWorker` init,

# will capture later with init_cuda_graphs()

backup_disable_cuda_graph = server_args.disable_cuda_graph

server_args.disable_cuda_graph = True

# Share the allocator with a target worker.

# Draft and target worker own their own KV cache pools.

self.req_to_token_pool, self.token_to_kv_pool_allocator = (

target_worker.get_memory_pool()

)

with empty_context():

# Init draft worker

self.draft_worker = TpModelWorker(

server_args=server_args,

gpu_id=gpu_id,

tp_rank=tp_rank,

pp_rank=0, # FIXME

dp_rank=dp_rank,

moe_ep_rank=moe_ep_rank,

nccl_port=nccl_port,

is_draft_worker=True,

req_to_token_pool=self.req_to_token_pool,

token_to_kv_pool_allocator=self.token_to_kv_pool_allocator,

)

# Alias for better readability

self.draft_runner = self.draft_worker.model_runner

self.init_token_map()

self.init_lm_head()

# Init attention backend and cuda graphs

self.draft_runner.server_args.disable_cuda_graph = backup_disable_cuda_graph

self.draft_tp_context = (

draft_tp_context if server_args.enable_dp_attention else empty_context

)

with self.draft_tp_context(self.draft_runner.tp_group):

self.init_attention_backend()

self.init_cuda_graphs()

from sglang.srt.speculative.eagle_utils import TreeMaskMode

self.tree_mask_mode = TreeMaskMode.FULL_MASK

self.plan_stream, self.plan_stream_ctx = _get_plan_stream(self.device)

def __init__(

self,

server_args: ServerArgs,

gpu_id: int,

tp_rank: int,

dp_rank: int,

moe_ep_rank: int,

nccl_port: int,

target_worker: TpModelWorker,

):

super().__init__(

server_args,

gpu_id,

tp_rank,

dp_rank,

moe_ep_rank,

nccl_port,

target_worker,

)

gemini-code-assist · 2025-11-04T11:53:05Z

python/sglang/srt/speculative/standalone_worker_v2.py

+    def __init__(
+        self,
+        server_args: ServerArgs,
+        gpu_id: int,
+        tp_rank: int,
+        dp_rank: Optional[int],
+        moe_ep_rank: int,
+        nccl_port: int,
+        target_worker: TpModelWorker,
+    ):
+        # Parse arguments
+        self.server_args = server_args
+        self.topk = server_args.speculative_eagle_topk
+        self.speculative_num_steps = server_args.speculative_num_steps
+        self.speculative_num_draft_tokens = server_args.speculative_num_draft_tokens
+        self.enable_nan_detection = server_args.enable_nan_detection
+        self.gpu_id = gpu_id
+        self.device = server_args.device
+        self._target_worker = target_worker
+        self.page_size = server_args.page_size
+        self.speculative_algorithm = SpeculativeAlgorithm.from_string(
+            server_args.speculative_algorithm
+        )
+
+        self.req_to_token_pool, self.token_to_kv_pool_allocator = (
+            target_worker.get_memory_pool()
+        )
+
+        # Override the context length of the draft model to be the same as the target model.
+        server_args.context_length = target_worker.model_runner.model_config.context_len
+
+        # Create our custom draft worker that doesn't share embeddings/lm_head
+        self._draft_worker = StandaloneDraftWorker(
+            server_args, gpu_id, tp_rank, dp_rank, moe_ep_rank, nccl_port, target_worker
+        )
+
+        # Some dummy tensors
+        self.num_new_pages_per_topk = torch.empty(
+            (), dtype=torch.int64, device=self.device
+        )
+        self.extend_lens = torch.empty((), dtype=torch.int64, device=self.device)
+
+        self.plan_stream, self.plan_stream_ctx = _get_plan_stream(self.device)


The __init__ method of StandaloneWorkerV2 is a near-exact copy of EAGLEWorkerV2.__init__, with the only difference being the instantiation of StandaloneDraftWorker instead of EagleDraftWorker. To reduce code duplication and make the code more maintainable, you can call super().__init__ and then just override the self._draft_worker attribute.

def __init__( self, server_args: ServerArgs, gpu_id: int, tp_rank: int, dp_rank: Optional[int], moe_ep_rank: int, nccl_port: int, target_worker: TpModelWorker, ): super().__init__( server_args, gpu_id, tp_rank, dp_rank, moe_ep_rank, nccl_port, target_worker, ) # Create our custom draft worker that doesn't share embeddings/lm_head self._draft_worker = StandaloneDraftWorker( server_args, gpu_id, tp_rank, dp_rank, moe_ep_rank, nccl_port, target_worker )

to issue#11762: add standalone_worker_v2

b7a3284

attack204 requested review from Ying1123, hnyls2002, kssteven418, merrymercy and xiezhq-hermann as code owners November 4, 2025 11:48

attack204 changed the title ~~to issue#11762: add standalone_worker_v2~~ feat(SpecEagleV2): add standalone_worker_v2 Nov 4, 2025

gemini-code-assist bot reviewed Nov 4, 2025

View reviewed changes

attack204 changed the title ~~feat(SpecEagleV2): add standalone_worker_v2~~ feat(SpecEagleV2): add standalone_worker_v2(WIP) Nov 4, 2025

attack204 mentioned this pull request Nov 5, 2025

[Bug] Beta spec-overlap _draft_extend_for_decode fail to use CudaGraph #12705

Open

5 tasks

	if self.buf_initialized or (not self.spec_algo.is_eagle() and not self.spec_algo.is_standalone()):
	if self.buf_initialized or not self.spec_algo.is_v2_algo():

	if self.spec_algo.is_eagle() or self.spec_algo.is_standalone():
	if self.spec_algo.is_v2_algo():

+    @property
+    def is_v2_spec(self):
+        return self.enable_overlap and self.spec_algorithm.is_v2_algo()

	if batch.is_v2_eagle or batch.is_v2_standalone:
	if batch.is_v2_spec:

	elif batch.is_v2_eagle or batch.is_v2_standalone:
	elif batch.is_v2_spec:

	if batch.spec_algorithm.is_eagle() or batch.spec_algorithm.is_standalone():
	if batch.spec_algorithm.is_v2_algo():

	if (batch.is_v2_eagle or batch.is_v2_standalone) and self.cur_batch.forward_mode.is_extend():
	if batch.is_v2_spec and self.cur_batch.forward_mode.is_extend():

		(self.speculative_algorithm == "EAGLE"
		or self.speculative_algorithm == "STANDALONE")

feat(SpecEagleV2): add standalone_worker_v2(WIP) #12625

Are you sure you want to change the base?

feat(SpecEagleV2): add standalone_worker_v2(WIP) #12625

Conversation

attack204 commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Test And Benchmark

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 4, 2025

attack204 commented Nov 4, 2025 •

edited

Loading