feat: support EPD disaggregation #12263

gty111 · 2025-10-28T08:11:54Z

Collaboration with @liusy58, @ZhengWG and @ShangmingCai

Motivation

related issues: #8223 #11355

Modifications

Add args for launching language-only model for prefill instance (--language-only) and vison-only model encode instance (--encoder-only).
Add a separate encode server for processing MM items when using arg --encoder-only (sglang/srt/disaggregation/encode_server.py).
Skip loading language part weights when using --language-only (sglang/srt/models/qwen2_5_vl.py).
Add args (--encoder-urls) for launching language-only model for prefill instance to specify list of encode urls.
Add an MMReceiver (sglang/srt/disaggregation/encode_receiver) to either TokenizerManager._tokenize_one_request or the Scheduler.process_input_requests (sglang/srt/managers/scheduler) so it can receive embeddings.
Add transfer backend zmq_to_scheduler, zmq_to_tokenizer and mooncake (--encoder-transfer-backend). zmq_to_scheduler receive embeddings at the scheduler level. zmq_to_tokenizer and mooncake receive embeddings at the tokenizer level.

Accuracy Tests

Qwen/Qwen2.5-VL-7B-Instruct

Evaluation script

python benchmark/mmmu/bench_sglang.py \
    --port 8000 \
    --concurrency 128

answers saved to: ./answer_sglang.json
{'Accounting': {'acc': 0.433, 'num': 30},
 'Agriculture': {'acc': 0.533, 'num': 30},
 'Architecture_and_Engineering': {'acc': 0.467, 'num': 30},
 'Art': {'acc': 0.667, 'num': 30},
 'Art_Theory': {'acc': 0.833, 'num': 30},
 'Basic_Medical_Science': {'acc': 0.6, 'num': 30},
 'Biology': {'acc': 0.367, 'num': 30},
 'Chemistry': {'acc': 0.367, 'num': 30},
 'Clinical_Medicine': {'acc': 0.633, 'num': 30},
 'Computer_Science': {'acc': 0.433, 'num': 30},
 'Design': {'acc': 0.7, 'num': 30},
 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.433, 'num': 30},
 'Economics': {'acc': 0.467, 'num': 30},
 'Electronics': {'acc': 0.267, 'num': 30},
 'Energy_and_Power': {'acc': 0.267, 'num': 30},
 'Finance': {'acc': 0.367, 'num': 30},
 'Geography': {'acc': 0.433, 'num': 30},
 'History': {'acc': 0.667, 'num': 30},
 'Literature': {'acc': 0.8, 'num': 30},
 'Manage': {'acc': 0.367, 'num': 30},
 'Marketing': {'acc': 0.467, 'num': 30},
 'Materials': {'acc': 0.433, 'num': 30},
 'Math': {'acc': 0.5, 'num': 30},
 'Mechanical_Engineering': {'acc': 0.5, 'num': 30},
 'Music': {'acc': 0.433, 'num': 30},
 'Overall': {'acc': 0.512, 'num': 900},
 'Overall-Art and Design': {'acc': 0.658, 'num': 120},
 'Overall-Business': {'acc': 0.42, 'num': 150},
 'Overall-Health and Medicine': {'acc': 0.587, 'num': 150},
 'Overall-Humanities and Social Science': {'acc': 0.675, 'num': 120},
 'Overall-Science': {'acc': 0.42, 'num': 150},
 'Overall-Tech and Engineering': {'acc': 0.414, 'num': 210},
 'Pharmacy': {'acc': 0.633, 'num': 30},
 'Physics': {'acc': 0.433, 'num': 30},
 'Psychology': {'acc': 0.7, 'num': 30},
 'Public_Health': {'acc': 0.633, 'num': 30},
 'Sociology': {'acc': 0.533, 'num': 30}}
eval out saved to ./val_sglang.json
Overall accuracy: 0.512

Benchmarking and Profiling

Qwen/Qwen2.5-VL-7B-Instruct, 32 reqs, 0.1 reqs/s, MMMU
Each prefill/decode/encode instance use one GPU
Extend one image per request to ten images per request

To lanunch prefill instance

MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30000
TP=1
MEM_FRACTION=0.5
CHUNK_SIZE=8192

SGLANG_VLM_CACHE_SIZE_MB=0 CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server \
    --model-path $MODEL \
    --disaggregation-mode prefill \
    --disaggregation-transfer-backend nixl \
    --tp $TP \
    --mem-fraction-static $MEM_FRACTION \
    --disable-radix-cache \
    --chunked-prefill-size $CHUNK_SIZE \
    --language-only \
    --encoder-urls http://127.0.0.1:30002 http://127.0.0.1:30003 http://127.0.0.1:30004 http://127.0.0.1:30005 http://127.0.0.1:30006 http://127.0.0.1:30007 \
    --port $PORT

To lanunch decode instance

MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30001
TP=1

CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server \
    --model-path $MODEL \
    --disaggregation-mode decode \
    --disaggregation-transfer-backend nixl \
    --tp $TP \
    --port $PORT

To launch encode instance

MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30002

CUDA_VISIBLE_DEVICES=2 taskset -c $1 python -m sglang.launch_server \
    --model-path $MODEL \
    --encoder-only \
    --port $PORT

To launch minlb

python -m sglang_router.launch_router \
  --pd-disaggregation \
  --mini-lb \
  --prefill http://127.0.0.1:30000 \
  --decode http://127.0.0.1:30001 \
  --port 8000

Benchmark script

NUM_PROMPTS=32
DATASET=mmmu
REQUEST_RATE=0.1
PORT=8000

python -m sglang.bench_serving \
    --num-prompts $NUM_PROMPTS \
    --dataset-name $DATASET \
    --port $PORT \
    --backend vllm-chat \
    --request-rate $REQUEST_RATE

Original resolution

	Mean TTFT	Median TTFT	P99 TTFT
1P1D	525.53	405.72	1783.71
1P1D6E	482.62	381.58	1522.27
1P1D6E*	461.77	376.47	1522.07
1P1D6E**	381.42	342.35	1227.27

Resize to 1920 x 1080

	Mean TTFT	Median TTFT	P99 TTFT
1P1D	12666.24	10827.31	25316.57
1P1D6E	5620.66	5247.78	9378.84
1P1D6E*	4028.82	3683.43	6061.41
1P1D6E**	3651.36	3344.39	5315.62

*: Further eliminate the overhead of preprocessing.
**: After refactoring to MMreceiver

Current benchmark result on (random 1-8 images) updated on 12.4:

Qwen3-VL-235B-A22B (FP8) H20

colocate

============ Serving Benchmark Result ============
Backend:                                 vllm-chat
Traffic request rate:                    0.5
Max request concurrency:                 not set
Successful requests:                     64
Benchmark duration (s):                  244.33
Total input tokens:                      719621
Total input text tokens:                 69587
Total input vision tokens:               650034
Total generated tokens:                  19200
Total generated tokens (retokenized):    18587
Request throughput (req/s):              0.26
Input token throughput (tok/s):          2945.26
Output token throughput (tok/s):         78.58
Total token throughput (tok/s):          3023.85
Concurrency:                             30.55
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   116643.41
Median E2E Latency (ms):                 120769.48
---------------Time to First Token----------------
Mean TTFT (ms):                          80213.08
Median TTFT (ms):                        101649.60
P99 TTFT (ms):                           124614.89
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          121.84
Median TPOT (ms):                        85.75
P99 TPOT (ms):                           373.92
---------------Inter-Token Latency----------------
Mean ITL (ms):                           128.84
Median ITL (ms):                         30.29
P95 ITL (ms):                            35.22
P99 ITL (ms):                            2190.96
Max ITL (ms):                            58096.81
==================================================

Ex2 + PDx1

Backend:                                 vllm-chat
Traffic request rate:                    0.5
Max request concurrency:                 not set
Successful requests:                     64
Benchmark duration (s):                  134.63
Total input tokens:                      719614
Total input text tokens:                 69580
Total input vision tokens:               650034
Total generated tokens:                  19200
Total generated tokens (retokenized):    19086
Request throughput (req/s):              0.48
Input token throughput (tok/s):          5345.26
Output token throughput (tok/s):         142.62
Total token throughput (tok/s):          5487.88
Concurrency:                             12.64
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   26580.97
Median E2E Latency (ms):                 26594.85
---------------Time to First Token----------------
Mean TTFT (ms):                          6442.68
Median TTFT (ms):                        5528.03
P99 TTFT (ms):                           13877.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          67.35
Median TPOT (ms):                        71.22
P99 TPOT (ms):                           87.47
---------------Inter-Token Latency----------------
Mean ITL (ms):                           68.45
Median ITL (ms):                         28.70
P95 ITL (ms):                            60.81
P99 ITL (ms):                            576.76
Max ITL (ms):                            9590.50
==================================================

Qwen3-VL-30B-A3B H100

colocate

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  71.10     
Total input tokens:                      379190    
Total input text tokens:                 72502     
Total input vision tokens:               306688    
Total generated tokens:                  20        
Total generated tokens (retokenized):    20        
Request throughput (req/s):              0.45      
Input token throughput (tok/s):          5332.89   
Output token throughput (tok/s):         0.28      
Total token throughput (tok/s):          5333.17   
Concurrency:                             1.29      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2865.13   
Median E2E Latency (ms):                 2530.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          1769.44   
Median TTFT (ms):                        1699.01   
P99 TTFT (ms):                           6265.33   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Ex4 + PDx1

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  71.12     
Total input tokens:                      379258    
Total input text tokens:                 72570     
Total input vision tokens:               306688    
Total generated tokens:                  20        
Total generated tokens (retokenized):    20        
Request throughput (req/s):              0.45      
Input token throughput (tok/s):          5332.67   
Output token throughput (tok/s):         0.28      
Total token throughput (tok/s):          5332.95   
Concurrency:                             0.66      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1468.98   
Median E2E Latency (ms):                 1404.18   
---------------Time to First Token----------------
Mean TTFT (ms):                          960.22    
Median TTFT (ms):                        1052.19   
P99 TTFT (ms):                           2907.76   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Qwen2.5-7B-VL H100

colocate

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  95.81     
Total input tokens:                      244968    
Total input text tokens:                 67664     
Total input vision tokens:               177304    
Total generated tokens:                  18        
Total generated tokens (retokenized):    18        
Request throughput (req/s):              0.33      
Input token throughput (tok/s):          2556.79   
Output token throughput (tok/s):         0.19      
Total token throughput (tok/s):          2556.98   
Concurrency:                             0.40      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1191.23   
Median E2E Latency (ms):                 1187.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          718.54    
Median TTFT (ms):                        717.21    
P99 TTFT (ms):                           2108.17   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Ex3 + PDx1

#Input tokens: 244991
#Output tokens: 18
#Total images: 148
#Images per request: min=1, max=8, mean=4.62

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  95.82     
Total input tokens:                      244991    
Total input text tokens:                 67687     
Total input vision tokens:               177304    
Total generated tokens:                  18        
Total generated tokens (retokenized):    18        
Request throughput (req/s):              0.33      
Input token throughput (tok/s):          2556.73   
Output token throughput (tok/s):         0.19      
Total token throughput (tok/s):          2556.91   
Concurrency:                             0.20      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   609.72    
Median E2E Latency (ms):                 611.81    
---------------Time to First Token----------------
Mean TTFT (ms):                          378.07    
Median TTFT (ms):                        465.47    
P99 TTFT (ms):                           993.11    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

gemini-code-assist · 2025-10-28T08:12:17Z

Summary of Changes

Hello @gty111, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural change by implementing Encode Prefill Disaggregation (EPD). This disaggregation separates the computationally intensive multimodal encoding process from the language model inference, allowing for specialized servers to handle image processing. The system now supports dedicated 'encode servers' that process visual inputs and transmit the resulting embeddings to 'language-only' prefill servers, enhancing efficiency and scalability for multimodal large language models (MLLMs).

Highlights

New Encode Server Entrypoint: A new entrypoint encode_server.py has been added to launch a dedicated server for multimodal encoding, activated by the --mm-only flag.
Model Configuration Flags: New mm_only and language_only flags are introduced in ModelConfig to allow loading only the multimodal encoder or only the language model components, respectively, enabling specialized server roles.
Image Encoding Logic: A new ImageEncoder class is implemented to handle image preprocessing (using transformers.AutoImageProcessor), resizing, and feature extraction, sending the resulting embeddings to other services via ZeroMQ.
Multimodal Embedding Handling: The _get_precomputed_embedding function in mm_utils.py is enhanced to support more precise slicing and handling of precomputed multimodal embeddings based on sequence lengths and offsets.
Tokenizer Manager Integration: The TokenizerManager now includes logic to receive precomputed embeddings from the encoding server via ZeroMQ when operating in a disaggregated PREFILL mode with language_only enabled, ensuring these embeddings are available for tokenization.
Conditional Model Loading: The Qwen2_5_VL model's initialization is updated to conditionally load its visual and language components based on the mm_only and language_only configuration flags, optimizing resource usage for specialized servers.
Server Argument Extensions: New command-line arguments --mm-only, --language-only, and --embedding-port are added to server_args.py to configure the disaggregated multimodal encoding setup.
Router Support for Encode Servers: The sgl-router now supports encode_urls to manage and distribute image encoding requests across multiple dedicated encode servers, improving scalability for multimodal inputs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request implements an Encode-Prefill-Decode (EPD) disaggregation strategy by introducing a separate server for multimodal encoding. The changes are extensive, touching server launch logic, configuration, model loading, and adding a new encoder server. While the overall approach is sound, there are a few critical issues to address. A key problem is the EmbeddingData class definition, which isn't shared between the new encoder server and the tokenizer manager, which will cause runtime failures. Additionally, the encoder server has hardcoded model-specific logic, limiting its extensibility. I've provided detailed comments on these points and a suggestion to improve code readability.

python/sglang/srt/entrypoints/encode_server.py

python/sglang/srt/managers/mm_utils.py

ShangmingCai

Changes look clean. Will finish the first round of review this week.

ShangmingCai

QQ: So we assume PD Disaggregation is enabled by default in this version? I thought we discussed that it is basically an implementation of Encoder DP, which I think should also work when Encoder is disaggregated while Prefill and Decode are not.

gty111 · 2025-10-31T02:33:52Z

QQ: So we assume PD Disaggregation is enabled by default in this version? I thought we discussed that it is basically an implementation of Encoder DP, which I think should also work when Encoder is disaggregated while Prefill and Decode are not.

Now, we support E + PD colocate based on EPD disaggregation. The usage is similar.

For prefill+decode server

MODEL=Qwen/Qwen2.5-VL-7B-Instruct

SGLANG_VLM_CACHE_SIZE_MB=0 CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server \
    --model-path $MODEL \
    --language-only \
    --encoder-urls http://127.0.0.1:30001 \
    --port 30000

For encode server, the script is the same as EPD disaggregation.

MODEL=Qwen/Qwen2.5-VL-7B-Instruct

CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server \
    --model-path $MODEL \
    --encoder-only \
    --port 30001

ZhengWG · 2025-11-03T13:18:15Z

Have you tested with larger model sizes? I noticed that the embedding data is transmitted via TCP, which could be time-consuming if the embedding data is relatively large.

QiuMike · 2025-11-03T13:36:06Z

I see your benchmark, 1p1d uses 2 cards, but 1p1d6E uses 8 cards, but the TTFT only decreased 50ms. Am I right?
And how about the QPS improvement at a certain SLO?

gty111 · 2025-11-03T13:55:14Z

Have you tested with larger model sizes? I noticed that the embedding data is transmitted via TCP, which could be time-consuming if the embedding data is relatively large.

Embedding Shape (rows, cols)	Transport Time (s)	Speed (MB/s)
[13641, 3584]	0.21302247047424316	438
[11652, 3584]	0.20847678184509277	374
[9663, 3584]	0.140641450881958	310
[8142, 3584]	0.1253678798675537	261
[7302, 3584]	0.11496186256408691	234
[5313, 3584]	0.0767507553100586	170
[4017, 3584]	0.07757830619812012	129
[1989, 3584]	0.03814363479614258	64

The TCP transport is the initial workaround. Next we can use nixl or mooncake to transmit embedding.

gty111 · 2025-11-03T14:02:29Z

I see your benchmark, 1p1d uses 2 cards, but 1p1d6E uses 8 cards, but the TTFT only decreased 50ms. Am I right? And how about the QPS improvement at a certain SLO?

The effectiveness of encoder disaggregation depends on the number of images per request and the number of tokens generated per image. By enabling multiple encoders to process images in parallel, the encoding latency can be reduced compared to the colocated setup. The improvement in QPS, however, depends on the test dataset and the configuration described above.

mickqian · 2025-12-09T02:07:14Z

/rerun-failed-ci

ShangmingCai · 2025-12-09T08:39:59Z

/rerun-failed-ci 2

mickqian · 2025-12-11T05:43:13Z

please stop merging new commits into here, let's see what happens

ShangmingCai · 2025-12-11T07:54:33Z

/rerun-failed-ci

python/sglang/srt/disaggregation/encode_receiver.py

Copilot

Pull request overview

This PR introduces support for Encoder-Prefill-Decode (EPD) disaggregation, enabling separate servers for vision encoding, prefill, and decode operations in multimodal language models. This architecture allows for better resource utilization and performance scaling for vision-language models.

Adds --encoder-only and --language-only flags to launch dedicated encoder and language-model-only servers
Implements three transfer backends (zmq_to_scheduler, zmq_to_tokenizer, mooncake) for embedding communication
Introduces MMReceiver component for handling multimodal embeddings across disaggregated instances

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 23 comments.

Show a summary per file

File	Description
test/srt/test_epd_disaggregation.py	Adds comprehensive tests for EPD disaggregation with single and multiple encoder configurations
test/srt/run_suite.py	Registers new EPD disaggregation test with 600s timeout
python/sglang/srt/server_args.py	Adds new CLI arguments for encoder disaggregation (encoder-only, language-only, encoder-urls, encoder-transfer-backend) with validation
python/sglang/srt/multimodal/processors/qwen_vl.py	Adds `get_mm_data` method to build multimodal data from precomputed embeddings
python/sglang/srt/multimodal/processors/dots_vlm.py	Renames token ID attributes from `IM_START_ID`/`IM_END_ID` to `IM_START_TOKEN_ID`/`IM_END_TOKEN_ID` for consistency
python/sglang/srt/multimodal/processors/base_processor.py	Adds base methods for building input IDs and processing multimodal data from embeddings
python/sglang/srt/models/qwen3_vl_moe.py	Adds encoder-only/language-only weight loading support to skip unnecessary model components
python/sglang/srt/models/qwen3_vl.py	Conditionally initializes language model components based on encoder-only mode
python/sglang/srt/models/qwen2_5_vl.py	Reorders initialization to handle encoder-only mode without loading language model weights
python/sglang/srt/models/dots_vlm.py	Adds encoder-only mode support to skip language model initialization
python/sglang/srt/managers/tokenizer_manager.py	Integrates MMReceiver for handling embeddings in zmq_to_tokenizer/mooncake backends
python/sglang/srt/managers/scheduler.py	Integrates MMReceiver for handling embeddings in zmq_to_scheduler backend
python/sglang/srt/managers/mm_utils.py	Enhances precomputed embedding handling with chunked prefill support
python/sglang/srt/managers/io_struct.py	Adds fields for tracking embedding ports and image waiting status
python/sglang/srt/disaggregation/encode_server.py	Implements dedicated encode server with FastAPI endpoints for encoding and sending embeddings
python/sglang/srt/disaggregation/encode_receiver.py	Implements MMReceiver class for receiving embeddings from encode servers
python/sglang/srt/configs/model_config.py	Adds encoder_only and language_only configuration parameters
python/sglang/launch_server.py	Routes to encode_server when encoder-only mode is specified
python/sglang/bench_serving.py	Adds random image count feature for benchmarking variable multimodal workloads

Comments suppressed due to low confidence (1)

python/sglang/srt/models/qwen2_5_vl.py:680

The condition on line 676 checks hasattr(self, "model") before accessing self.model.start_layer. However, in encoder-only mode, self.model is not initialized (as seen in qwen2_5_vl.py lines 481-503). While this check prevents the AttributeError, the logic should be clearer about when model exists. Consider restructuring to check self.config.encoder_only first.

                if (
                    layer_id is not None
                    and hasattr(self, "model")
                    and hasattr(self.model, "start_layer")
                    and (
                        layer_id < self.model.start_layer
                        or layer_id >= self.model.end_layer

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/srt/run_suite.py

python/sglang/srt/server_args.py

python/sglang/srt/managers/mm_utils.py

python/sglang/srt/managers/tokenizer_manager.py

python/sglang/srt/models/qwen2_5_vl.py

python/sglang/srt/disaggregation/encode_receiver.py

python/sglang/srt/disaggregation/encode_server.py

python/sglang/srt/multimodal/processors/base_processor.py

yhyang201 · 2025-12-13T14:12:51Z

/tag-and-rerun-ci

yhyang201 · 2025-12-13T14:22:18Z

LGTM.
This PR contains a relatively large amount of code, so we also ran the Nightly Test: https://github.com/sgl-project/sglang/actions/runs/20193317653

I will continue to monitor the CI and Nightly Test results; if any issues arise, I will rerun them and keep an eye on the outcomes.

yhyang201 · 2025-12-14T12:54:55Z

PR CI:

This PR includes two key tests:

test_epd_disaggregation.py: Passed
https://github.com/sgl-project/sglang/actions/runs/20202349810/job/58008915513?pr=12263
test_vision_openai_server_a.py: Passed
https://github.com/sgl-project/sglang/actions/runs/20202349810/job/58008915064?pr=12263

In the second-to-last commit, test_epd_disaggregation.py encountered an issue, while all other tests passed:
https://github.com/sgl-project/sglang/actions/runs/20193135324/job/57995148367

The final commit fixes test/srt/test_epd_disaggregation.py, involving only a parameter-loading–related change. After this fix, all tests passed except for unit-test-backend-8-gpu-h200. Notably, this test passed in the second-to-last commit, so I believe the recent change should not affect its result. The current failure is more likely due to CI environment instability:
https://github.com/sgl-project/sglang/actions/runs/20202349810/job/58003156153?pr=12263

Nightly CI:

nightly-test-vlm-accuracy-2-gpu-runner: Passed
nightly-test-vlm-perf-2-gpu-runner: Passed

https://github.com/sgl-project/sglang/actions/runs/20193317653/job/57995164803

nightly-test-multimodal-server-2-gpu (1): Failed, but this PR is unrelated to SGL-Diffusion, so the failure should not be related to this PR
nightly-test-general-1-gpu-runner: nightly/test_lora_eviction_policy.py Failed. This test has been failing in several recent nightly CI runs, and the failure is therefore considered unrelated to this PR

In summary, the remaining CI failures are most likely due to CI instability or factors unrelated to this PR. The code changes and relevant tests for this PR are in good shape, and I believe the PR meets the criteria for merging.

ShangmingCai · 2025-12-14T15:00:03Z

Thx for the review. If any bug been reported by our production team, will fix it with @gty111 @liusy58 @ZhengWG ASAP.

gty111 requested review from CatherineSue, JustinTong0323, Ying1123, hnyls2002, ispobock, key4ng, merrymercy, slin1237, xiezhq-hermann and zhyncs as code owners October 28, 2025 08:11

gemini-code-assist bot reviewed Oct 28, 2025

View reviewed changes

python/sglang/srt/entrypoints/encode_server.py Outdated Show resolved Hide resolved

python/sglang/srt/entrypoints/encode_server.py Outdated Show resolved Hide resolved

python/sglang/srt/managers/mm_utils.py Outdated Show resolved Hide resolved

ShangmingCai assigned ShangmingCai and mickqian Oct 28, 2025

hzh0425 self-assigned this Oct 28, 2025

gty111 force-pushed the epd_rebase branch from 5ff2cad to 5ee37f5 Compare October 28, 2025 10:40

ShangmingCai added the run-ci label Oct 28, 2025

mickqian changed the title ~~Implement EPD~~ feat: support EPD Oct 28, 2025

mickqian changed the title ~~feat: support EPD~~ feat: support EPD disaggregation Oct 28, 2025

gty111 force-pushed the epd_rebase branch from 92ec421 to c31e5fa Compare October 29, 2025 07:47

ShangmingCai reviewed Oct 29, 2025

View reviewed changes

gty111 force-pushed the epd_rebase branch from 663d20e to 4642f5e Compare October 30, 2025 04:27

ShangmingCai reviewed Oct 30, 2025

View reviewed changes

gty111 force-pushed the epd_rebase branch 2 times, most recently from 27bc02b to af35caf Compare November 3, 2025 02:12

Merge branch 'main' into epd_rebase

079f5d9

Merge branch 'main' into epd_rebase

67157bd

liusy58 and others added 5 commits December 9, 2025 23:20

Merge branch 'main' into epd_rebase

f990b42

Merge branch 'main' into epd_rebase

eeadb68

Lint

f3ceb9f

Merge branch 'main' into epd_rebase

0b08b79

Merge branch 'main' into epd_rebase

b7b1ed8

yangsijia-serena reviewed Dec 11, 2025

View reviewed changes

python/sglang/srt/disaggregation/encode_receiver.py Outdated Show resolved Hide resolved

yangsijia-serena reviewed Dec 11, 2025

View reviewed changes

python/sglang/srt/disaggregation/encode_receiver.py Outdated Show resolved Hide resolved

ut: speed up epd_dis ut (#12)

f2195db

Copilot AI review requested due to automatic review settings December 12, 2025 06:02

Copilot started reviewing on behalf of ZhengWG December 12, 2025 06:02 View session

Copilot AI reviewed Dec 12, 2025

View reviewed changes

gty111 added 3 commits December 13, 2025 21:39

Fix port allocation and other lints (#13)

b105ec1

Merge branch 'main' into epd_rebase

05a0fbb

Fix merge

6b5ded7

fix test

c9126ed

mickqian merged commit 9acb21a into sgl-project:main Dec 14, 2025
344 of 373 checks passed

ShangmingCai mentioned this pull request Dec 14, 2025

[Roadmap] Encoder Disaggregation for Multi-modal models #15118

Open

5 tasks

This was referenced Dec 16, 2025

Add EPD disaggregation doc #15224

Merged

Support PP for zmq_to_scheduler #15312

Open

ZhengWG mentioned this pull request Dec 19, 2025

[WIP][EPD][VLM] support video input(qwen-series) #15475

Open

6 tasks

feat: support EPD disaggregation #12263

feat: support EPD disaggregation #12263

Uh oh!

Conversation

gty111 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Original resolution

Resize to 1920 x 1080

Qwen3-VL-235B-A22B (FP8) H20

Qwen3-VL-30B-A3B H100

Qwen2.5-7B-VL H100

Uh oh!

gemini-code-assist bot commented Oct 28, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

gty111 commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZhengWG commented Nov 3, 2025

Uh oh!

QiuMike commented Nov 3, 2025

Uh oh!

gty111 commented Nov 3, 2025

Uh oh!

gty111 commented Nov 3, 2025

Uh oh!

mickqian commented Dec 9, 2025

Uh oh!

ShangmingCai commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mickqian commented Dec 11, 2025

Uh oh!

ShangmingCai commented Dec 11, 2025

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yhyang201 commented Dec 13, 2025

Uh oh!

yhyang201 commented Dec 13, 2025

Uh oh!

yhyang201 commented Dec 14, 2025

Uh oh!

Uh oh!

ShangmingCai commented Dec 14, 2025

Uh oh!

gty111 commented Oct 28, 2025 •

edited

Loading

gty111 commented Oct 31, 2025 •

edited

Loading

ShangmingCai commented Dec 9, 2025 •

edited

Loading