Skip to content

Conversation

@gty111
Copy link
Contributor

@gty111 gty111 commented Oct 28, 2025

Collaboration with @liusy58, @ZhengWG and @ShangmingCai

Motivation

related issues: #8223 #11355

Modifications

  1. Add args for launching language-only model for prefill instance (--language-only) and vison-only model encode instance (--encoder-only).
  2. Add a separate encode server for processing MM items when using arg --encoder-only (sglang/srt/disaggregation/encode_server.py).
  3. Skip loading language part weights when using --language-only (sglang/srt/models/qwen2_5_vl.py).
  4. Add args (--encoder-urls) for launching language-only model for prefill instance to specify list of encode urls.
  5. Add an MMReceiver (sglang/srt/disaggregation/encode_receiver) to either TokenizerManager._tokenize_one_request or the Scheduler.process_input_requests (sglang/srt/managers/scheduler) so it can receive embeddings.
  6. Add transfer backend zmq_to_scheduler, zmq_to_tokenizer and mooncake (--encoder-transfer-backend). zmq_to_scheduler receive embeddings at the scheduler level. zmq_to_tokenizer and mooncake receive embeddings at the tokenizer level.

Accuracy Tests

Qwen/Qwen2.5-VL-7B-Instruct

  • Evaluation script
python benchmark/mmmu/bench_sglang.py \
    --port 8000 \
    --concurrency 128
answers saved to: ./answer_sglang.json
{'Accounting': {'acc': 0.433, 'num': 30},
 'Agriculture': {'acc': 0.533, 'num': 30},
 'Architecture_and_Engineering': {'acc': 0.467, 'num': 30},
 'Art': {'acc': 0.667, 'num': 30},
 'Art_Theory': {'acc': 0.833, 'num': 30},
 'Basic_Medical_Science': {'acc': 0.6, 'num': 30},
 'Biology': {'acc': 0.367, 'num': 30},
 'Chemistry': {'acc': 0.367, 'num': 30},
 'Clinical_Medicine': {'acc': 0.633, 'num': 30},
 'Computer_Science': {'acc': 0.433, 'num': 30},
 'Design': {'acc': 0.7, 'num': 30},
 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.433, 'num': 30},
 'Economics': {'acc': 0.467, 'num': 30},
 'Electronics': {'acc': 0.267, 'num': 30},
 'Energy_and_Power': {'acc': 0.267, 'num': 30},
 'Finance': {'acc': 0.367, 'num': 30},
 'Geography': {'acc': 0.433, 'num': 30},
 'History': {'acc': 0.667, 'num': 30},
 'Literature': {'acc': 0.8, 'num': 30},
 'Manage': {'acc': 0.367, 'num': 30},
 'Marketing': {'acc': 0.467, 'num': 30},
 'Materials': {'acc': 0.433, 'num': 30},
 'Math': {'acc': 0.5, 'num': 30},
 'Mechanical_Engineering': {'acc': 0.5, 'num': 30},
 'Music': {'acc': 0.433, 'num': 30},
 'Overall': {'acc': 0.512, 'num': 900},
 'Overall-Art and Design': {'acc': 0.658, 'num': 120},
 'Overall-Business': {'acc': 0.42, 'num': 150},
 'Overall-Health and Medicine': {'acc': 0.587, 'num': 150},
 'Overall-Humanities and Social Science': {'acc': 0.675, 'num': 120},
 'Overall-Science': {'acc': 0.42, 'num': 150},
 'Overall-Tech and Engineering': {'acc': 0.414, 'num': 210},
 'Pharmacy': {'acc': 0.633, 'num': 30},
 'Physics': {'acc': 0.433, 'num': 30},
 'Psychology': {'acc': 0.7, 'num': 30},
 'Public_Health': {'acc': 0.633, 'num': 30},
 'Sociology': {'acc': 0.533, 'num': 30}}
eval out saved to ./val_sglang.json
Overall accuracy: 0.512

Benchmarking and Profiling

Qwen/Qwen2.5-VL-7B-Instruct, 32 reqs, 0.1 reqs/s, MMMU
Each prefill/decode/encode instance use one GPU
Extend one image per request to ten images per request

  • To lanunch prefill instance
MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30000
TP=1
MEM_FRACTION=0.5
CHUNK_SIZE=8192

SGLANG_VLM_CACHE_SIZE_MB=0 CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server \
    --model-path $MODEL \
    --disaggregation-mode prefill \
    --disaggregation-transfer-backend nixl \
    --tp $TP \
    --mem-fraction-static $MEM_FRACTION \
    --disable-radix-cache \
    --chunked-prefill-size $CHUNK_SIZE \
    --language-only \
    --encoder-urls http://127.0.0.1:30002 http://127.0.0.1:30003 http://127.0.0.1:30004 http://127.0.0.1:30005 http://127.0.0.1:30006 http://127.0.0.1:30007 \
    --port $PORT
  • To lanunch decode instance
MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30001
TP=1

CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server \
    --model-path $MODEL \
    --disaggregation-mode decode \
    --disaggregation-transfer-backend nixl \
    --tp $TP \
    --port $PORT
  • To launch encode instance
MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30002

CUDA_VISIBLE_DEVICES=2 taskset -c $1 python -m sglang.launch_server \
    --model-path $MODEL \
    --encoder-only \
    --port $PORT
  • To launch minlb
python -m sglang_router.launch_router \
  --pd-disaggregation \
  --mini-lb \
  --prefill http://127.0.0.1:30000 \
  --decode http://127.0.0.1:30001 \
  --port 8000
  • Benchmark script
NUM_PROMPTS=32
DATASET=mmmu
REQUEST_RATE=0.1
PORT=8000

python -m sglang.bench_serving \
    --num-prompts $NUM_PROMPTS \
    --dataset-name $DATASET \
    --port $PORT \
    --backend vllm-chat \
    --request-rate $REQUEST_RATE 

Original resolution

Mean TTFT Median TTFT P99 TTFT
1P1D 525.53 405.72 1783.71
1P1D6E 482.62 381.58 1522.27
1P1D6E* 461.77 376.47 1522.07
1P1D6E** 381.42 342.35 1227.27

Resize to 1920 x 1080

Mean TTFT Median TTFT P99 TTFT
1P1D 12666.24 10827.31 25316.57
1P1D6E 5620.66 5247.78 9378.84
1P1D6E* 4028.82 3683.43 6061.41
1P1D6E** 3651.36 3344.39 5315.62

*: Further eliminate the overhead of preprocessing.
**: After refactoring to MMreceiver

Current benchmark result on (random 1-8 images) updated on 12.4:

Qwen3-VL-235B-A22B (FP8) H20

  • colocate
============ Serving Benchmark Result ============
Backend:                                 vllm-chat
Traffic request rate:                    0.5
Max request concurrency:                 not set
Successful requests:                     64
Benchmark duration (s):                  244.33
Total input tokens:                      719621
Total input text tokens:                 69587
Total input vision tokens:               650034
Total generated tokens:                  19200
Total generated tokens (retokenized):    18587
Request throughput (req/s):              0.26
Input token throughput (tok/s):          2945.26
Output token throughput (tok/s):         78.58
Total token throughput (tok/s):          3023.85
Concurrency:                             30.55
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   116643.41
Median E2E Latency (ms):                 120769.48
---------------Time to First Token----------------
Mean TTFT (ms):                          80213.08
Median TTFT (ms):                        101649.60
P99 TTFT (ms):                           124614.89
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          121.84
Median TPOT (ms):                        85.75
P99 TPOT (ms):                           373.92
---------------Inter-Token Latency----------------
Mean ITL (ms):                           128.84
Median ITL (ms):                         30.29
P95 ITL (ms):                            35.22
P99 ITL (ms):                            2190.96
Max ITL (ms):                            58096.81
==================================================
  • Ex2 + PDx1
Backend:                                 vllm-chat
Traffic request rate:                    0.5
Max request concurrency:                 not set
Successful requests:                     64
Benchmark duration (s):                  134.63
Total input tokens:                      719614
Total input text tokens:                 69580
Total input vision tokens:               650034
Total generated tokens:                  19200
Total generated tokens (retokenized):    19086
Request throughput (req/s):              0.48
Input token throughput (tok/s):          5345.26
Output token throughput (tok/s):         142.62
Total token throughput (tok/s):          5487.88
Concurrency:                             12.64
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   26580.97
Median E2E Latency (ms):                 26594.85
---------------Time to First Token----------------
Mean TTFT (ms):                          6442.68
Median TTFT (ms):                        5528.03
P99 TTFT (ms):                           13877.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          67.35
Median TPOT (ms):                        71.22
P99 TPOT (ms):                           87.47
---------------Inter-Token Latency----------------
Mean ITL (ms):                           68.45
Median ITL (ms):                         28.70
P95 ITL (ms):                            60.81
P99 ITL (ms):                            576.76
Max ITL (ms):                            9590.50
==================================================

Qwen3-VL-30B-A3B H100

  • colocate
============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  71.10     
Total input tokens:                      379190    
Total input text tokens:                 72502     
Total input vision tokens:               306688    
Total generated tokens:                  20        
Total generated tokens (retokenized):    20        
Request throughput (req/s):              0.45      
Input token throughput (tok/s):          5332.89   
Output token throughput (tok/s):         0.28      
Total token throughput (tok/s):          5333.17   
Concurrency:                             1.29      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2865.13   
Median E2E Latency (ms):                 2530.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          1769.44   
Median TTFT (ms):                        1699.01   
P99 TTFT (ms):                           6265.33   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================
  • Ex4 + PDx1
============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  71.12     
Total input tokens:                      379258    
Total input text tokens:                 72570     
Total input vision tokens:               306688    
Total generated tokens:                  20        
Total generated tokens (retokenized):    20        
Request throughput (req/s):              0.45      
Input token throughput (tok/s):          5332.67   
Output token throughput (tok/s):         0.28      
Total token throughput (tok/s):          5332.95   
Concurrency:                             0.66      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1468.98   
Median E2E Latency (ms):                 1404.18   
---------------Time to First Token----------------
Mean TTFT (ms):                          960.22    
Median TTFT (ms):                        1052.19   
P99 TTFT (ms):                           2907.76   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Qwen2.5-7B-VL H100

  • colocate
============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  95.81     
Total input tokens:                      244968    
Total input text tokens:                 67664     
Total input vision tokens:               177304    
Total generated tokens:                  18        
Total generated tokens (retokenized):    18        
Request throughput (req/s):              0.33      
Input token throughput (tok/s):          2556.79   
Output token throughput (tok/s):         0.19      
Total token throughput (tok/s):          2556.98   
Concurrency:                             0.40      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1191.23   
Median E2E Latency (ms):                 1187.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          718.54    
Median TTFT (ms):                        717.21    
P99 TTFT (ms):                           2108.17   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================
  • Ex3 + PDx1
#Input tokens: 244991
#Output tokens: 18
#Total images: 148
#Images per request: min=1, max=8, mean=4.62

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  95.82     
Total input tokens:                      244991    
Total input text tokens:                 67687     
Total input vision tokens:               177304    
Total generated tokens:                  18        
Total generated tokens (retokenized):    18        
Request throughput (req/s):              0.33      
Input token throughput (tok/s):          2556.73   
Output token throughput (tok/s):         0.19      
Total token throughput (tok/s):          2556.91   
Concurrency:                             0.20      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   609.72    
Median E2E Latency (ms):                 611.81    
---------------Time to First Token----------------
Mean TTFT (ms):                          378.07    
Median TTFT (ms):                        465.47    
P99 TTFT (ms):                           993.11    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @gty111, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural change by implementing Encode Prefill Disaggregation (EPD). This disaggregation separates the computationally intensive multimodal encoding process from the language model inference, allowing for specialized servers to handle image processing. The system now supports dedicated 'encode servers' that process visual inputs and transmit the resulting embeddings to 'language-only' prefill servers, enhancing efficiency and scalability for multimodal large language models (MLLMs).

Highlights

  • New Encode Server Entrypoint: A new entrypoint encode_server.py has been added to launch a dedicated server for multimodal encoding, activated by the --mm-only flag.
  • Model Configuration Flags: New mm_only and language_only flags are introduced in ModelConfig to allow loading only the multimodal encoder or only the language model components, respectively, enabling specialized server roles.
  • Image Encoding Logic: A new ImageEncoder class is implemented to handle image preprocessing (using transformers.AutoImageProcessor), resizing, and feature extraction, sending the resulting embeddings to other services via ZeroMQ.
  • Multimodal Embedding Handling: The _get_precomputed_embedding function in mm_utils.py is enhanced to support more precise slicing and handling of precomputed multimodal embeddings based on sequence lengths and offsets.
  • Tokenizer Manager Integration: The TokenizerManager now includes logic to receive precomputed embeddings from the encoding server via ZeroMQ when operating in a disaggregated PREFILL mode with language_only enabled, ensuring these embeddings are available for tokenization.
  • Conditional Model Loading: The Qwen2_5_VL model's initialization is updated to conditionally load its visual and language components based on the mm_only and language_only configuration flags, optimizing resource usage for specialized servers.
  • Server Argument Extensions: New command-line arguments --mm-only, --language-only, and --embedding-port are added to server_args.py to configure the disaggregated multimodal encoding setup.
  • Router Support for Encode Servers: The sgl-router now supports encode_urls to manage and distribute image encoding requests across multiple dedicated encode servers, improving scalability for multimodal inputs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an Encode-Prefill-Decode (EPD) disaggregation strategy by introducing a separate server for multimodal encoding. The changes are extensive, touching server launch logic, configuration, model loading, and adding a new encoder server. While the overall approach is sound, there are a few critical issues to address. A key problem is the EmbeddingData class definition, which isn't shared between the new encoder server and the tokenizer manager, which will cause runtime failures. Additionally, the encoder server has hardcoded model-specific logic, limiting its extensibility. I've provided detailed comments on these points and a suggestion to improve code readability.

@hzh0425 hzh0425 self-assigned this Oct 28, 2025
@mickqian mickqian changed the title Implement EPD feat: support EPD Oct 28, 2025
@mickqian mickqian changed the title feat: support EPD feat: support EPD disaggregation Oct 28, 2025
Copy link
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look clean. Will finish the first round of review this week.

Copy link
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: So we assume PD Disaggregation is enabled by default in this version? I thought we discussed that it is basically an implementation of Encoder DP, which I think should also work when Encoder is disaggregated while Prefill and Decode are not.

@gty111
Copy link
Contributor Author

gty111 commented Oct 31, 2025

QQ: So we assume PD Disaggregation is enabled by default in this version? I thought we discussed that it is basically an implementation of Encoder DP, which I think should also work when Encoder is disaggregated while Prefill and Decode are not.

Now, we support E + PD colocate based on EPD disaggregation. The usage is similar.

  • For prefill+decode server
MODEL=Qwen/Qwen2.5-VL-7B-Instruct

SGLANG_VLM_CACHE_SIZE_MB=0 CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server \
    --model-path $MODEL \
    --language-only \
    --encoder-urls http://127.0.0.1:30001 \
    --port 30000
  • For encode server, the script is the same as EPD disaggregation.
MODEL=Qwen/Qwen2.5-VL-7B-Instruct

CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server \
    --model-path $MODEL \
    --encoder-only \
    --port 30001

@gty111 gty111 force-pushed the epd_rebase branch 2 times, most recently from 27bc02b to af35caf Compare November 3, 2025 02:12
@ZhengWG
Copy link
Contributor

ZhengWG commented Nov 3, 2025

Have you tested with larger model sizes? I noticed that the embedding data is transmitted via TCP, which could be time-consuming if the embedding data is relatively large.

@QiuMike
Copy link
Contributor

QiuMike commented Nov 3, 2025

I see your benchmark, 1p1d uses 2 cards, but 1p1d6E uses 8 cards, but the TTFT only decreased 50ms. Am I right?
And how about the QPS improvement at a certain SLO?

@gty111
Copy link
Contributor Author

gty111 commented Nov 3, 2025

Have you tested with larger model sizes? I noticed that the embedding data is transmitted via TCP, which could be time-consuming if the embedding data is relatively large.

Embedding Shape (rows, cols) Transport Time (s) Speed (MB/s)
[13641, 3584] 0.21302247047424316 438
[11652, 3584] 0.20847678184509277 374
[9663, 3584] 0.140641450881958 310
[8142, 3584] 0.1253678798675537 261
[7302, 3584] 0.11496186256408691 234
[5313, 3584] 0.0767507553100586 170
[4017, 3584] 0.07757830619812012 129
[1989, 3584] 0.03814363479614258 64

The TCP transport is the initial workaround. Next we can use nixl or mooncake to transmit embedding.

@gty111
Copy link
Contributor Author

gty111 commented Nov 3, 2025

I see your benchmark, 1p1d uses 2 cards, but 1p1d6E uses 8 cards, but the TTFT only decreased 50ms. Am I right? And how about the QPS improvement at a certain SLO?

The effectiveness of encoder disaggregation depends on the number of images per request and the number of tokens generated per image. By enabling multiple encoders to process images in parallel, the encoding latency can be reduced compared to the colocated setup. The improvement in QPS, however, depends on the test dataset and the configuration described above.

@mickqian
Copy link
Collaborator

mickqian commented Dec 9, 2025

/rerun-failed-ci

@ShangmingCai
Copy link
Collaborator

ShangmingCai commented Dec 9, 2025

/rerun-failed-ci 2

@mickqian
Copy link
Collaborator

please stop merging new commits into here, let's see what happens

@ShangmingCai
Copy link
Collaborator

/rerun-failed-ci

Copilot AI review requested due to automatic review settings December 12, 2025 06:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces support for Encoder-Prefill-Decode (EPD) disaggregation, enabling separate servers for vision encoding, prefill, and decode operations in multimodal language models. This architecture allows for better resource utilization and performance scaling for vision-language models.

  • Adds --encoder-only and --language-only flags to launch dedicated encoder and language-model-only servers
  • Implements three transfer backends (zmq_to_scheduler, zmq_to_tokenizer, mooncake) for embedding communication
  • Introduces MMReceiver component for handling multimodal embeddings across disaggregated instances

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 23 comments.

Show a summary per file
File Description
test/srt/test_epd_disaggregation.py Adds comprehensive tests for EPD disaggregation with single and multiple encoder configurations
test/srt/run_suite.py Registers new EPD disaggregation test with 600s timeout
python/sglang/srt/server_args.py Adds new CLI arguments for encoder disaggregation (encoder-only, language-only, encoder-urls, encoder-transfer-backend) with validation
python/sglang/srt/multimodal/processors/qwen_vl.py Adds get_mm_data method to build multimodal data from precomputed embeddings
python/sglang/srt/multimodal/processors/dots_vlm.py Renames token ID attributes from IM_START_ID/IM_END_ID to IM_START_TOKEN_ID/IM_END_TOKEN_ID for consistency
python/sglang/srt/multimodal/processors/base_processor.py Adds base methods for building input IDs and processing multimodal data from embeddings
python/sglang/srt/models/qwen3_vl_moe.py Adds encoder-only/language-only weight loading support to skip unnecessary model components
python/sglang/srt/models/qwen3_vl.py Conditionally initializes language model components based on encoder-only mode
python/sglang/srt/models/qwen2_5_vl.py Reorders initialization to handle encoder-only mode without loading language model weights
python/sglang/srt/models/dots_vlm.py Adds encoder-only mode support to skip language model initialization
python/sglang/srt/managers/tokenizer_manager.py Integrates MMReceiver for handling embeddings in zmq_to_tokenizer/mooncake backends
python/sglang/srt/managers/scheduler.py Integrates MMReceiver for handling embeddings in zmq_to_scheduler backend
python/sglang/srt/managers/mm_utils.py Enhances precomputed embedding handling with chunked prefill support
python/sglang/srt/managers/io_struct.py Adds fields for tracking embedding ports and image waiting status
python/sglang/srt/disaggregation/encode_server.py Implements dedicated encode server with FastAPI endpoints for encoding and sending embeddings
python/sglang/srt/disaggregation/encode_receiver.py Implements MMReceiver class for receiving embeddings from encode servers
python/sglang/srt/configs/model_config.py Adds encoder_only and language_only configuration parameters
python/sglang/launch_server.py Routes to encode_server when encoder-only mode is specified
python/sglang/bench_serving.py Adds random image count feature for benchmarking variable multimodal workloads
Comments suppressed due to low confidence (1)

python/sglang/srt/models/qwen2_5_vl.py:680

  • The condition on line 676 checks hasattr(self, "model") before accessing self.model.start_layer. However, in encoder-only mode, self.model is not initialized (as seen in qwen2_5_vl.py lines 481-503). While this check prevents the AttributeError, the logic should be clearer about when model exists. Consider restructuring to check self.config.encoder_only first.
                if (
                    layer_id is not None
                    and hasattr(self, "model")
                    and hasattr(self.model, "start_layer")
                    and (
                        layer_id < self.model.start_layer
                        or layer_id >= self.model.end_layer

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yhyang201
Copy link
Collaborator

/tag-and-rerun-ci

@yhyang201
Copy link
Collaborator

LGTM.
This PR contains a relatively large amount of code, so we also ran the Nightly Test: https://github.com/sgl-project/sglang/actions/runs/20193317653

I will continue to monitor the CI and Nightly Test results; if any issues arise, I will rerun them and keep an eye on the outcomes.

@yhyang201
Copy link
Collaborator

PR CI:

This PR includes two key tests:

In the second-to-last commit, test_epd_disaggregation.py encountered an issue, while all other tests passed:
https://github.com/sgl-project/sglang/actions/runs/20193135324/job/57995148367

The final commit fixes test/srt/test_epd_disaggregation.py, involving only a parameter-loading–related change. After this fix, all tests passed except for unit-test-backend-8-gpu-h200. Notably, this test passed in the second-to-last commit, so I believe the recent change should not affect its result. The current failure is more likely due to CI environment instability:
https://github.com/sgl-project/sglang/actions/runs/20202349810/job/58003156153?pr=12263

Nightly CI:

nightly-test-vlm-accuracy-2-gpu-runner: Passed
nightly-test-vlm-perf-2-gpu-runner: Passed

https://github.com/sgl-project/sglang/actions/runs/20193317653/job/57995164803

nightly-test-multimodal-server-2-gpu (1): Failed, but this PR is unrelated to SGL-Diffusion, so the failure should not be related to this PR
nightly-test-general-1-gpu-runner: nightly/test_lora_eviction_policy.py Failed. This test has been failing in several recent nightly CI runs, and the failure is therefore considered unrelated to this PR


In summary, the remaining CI failures are most likely due to CI instability or factors unrelated to this PR. The code changes and relevant tests for this PR are in good shape, and I believe the PR meets the criteria for merging.

@mickqian mickqian merged commit 9acb21a into sgl-project:main Dec 14, 2025
344 of 373 checks passed
@ShangmingCai
Copy link
Collaborator

Thx for the review. If any bug been reported by our production team, will fix it with @gty111 @liusy58 @ZhengWG ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.