-
Notifications
You must be signed in to change notification settings - Fork 3.8k
feat: support EPD disaggregation #12263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @gty111, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant architectural change by implementing Encode Prefill Disaggregation (EPD). This disaggregation separates the computationally intensive multimodal encoding process from the language model inference, allowing for specialized servers to handle image processing. The system now supports dedicated 'encode servers' that process visual inputs and transmit the resulting embeddings to 'language-only' prefill servers, enhancing efficiency and scalability for multimodal large language models (MLLMs). Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request implements an Encode-Prefill-Decode (EPD) disaggregation strategy by introducing a separate server for multimodal encoding. The changes are extensive, touching server launch logic, configuration, model loading, and adding a new encoder server. While the overall approach is sound, there are a few critical issues to address. A key problem is the EmbeddingData class definition, which isn't shared between the new encoder server and the tokenizer manager, which will cause runtime failures. Additionally, the encoder server has hardcoded model-specific logic, limiting its extensibility. I've provided detailed comments on these points and a suggestion to improve code readability.
ShangmingCai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look clean. Will finish the first round of review this week.
ShangmingCai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: So we assume PD Disaggregation is enabled by default in this version? I thought we discussed that it is basically an implementation of Encoder DP, which I think should also work when Encoder is disaggregated while Prefill and Decode are not.
Now, we support E + PD colocate based on EPD disaggregation. The usage is similar.
|
27bc02b to
af35caf
Compare
|
Have you tested with larger model sizes? I noticed that the embedding data is transmitted via TCP, which could be time-consuming if the embedding data is relatively large. |
|
I see your benchmark, 1p1d uses 2 cards, but 1p1d6E uses 8 cards, but the TTFT only decreased 50ms. Am I right? |
The TCP transport is the initial workaround. Next we can use nixl or mooncake to transmit embedding. |
The effectiveness of encoder disaggregation depends on the number of images per request and the number of tokens generated per image. By enabling multiple encoders to process images in parallel, the encoding latency can be reduced compared to the colocated setup. The improvement in QPS, however, depends on the test dataset and the configuration described above. |
|
/rerun-failed-ci |
|
/rerun-failed-ci 2 |
|
please stop merging new commits into here, let's see what happens |
|
/rerun-failed-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces support for Encoder-Prefill-Decode (EPD) disaggregation, enabling separate servers for vision encoding, prefill, and decode operations in multimodal language models. This architecture allows for better resource utilization and performance scaling for vision-language models.
- Adds
--encoder-onlyand--language-onlyflags to launch dedicated encoder and language-model-only servers - Implements three transfer backends (
zmq_to_scheduler,zmq_to_tokenizer,mooncake) for embedding communication - Introduces
MMReceivercomponent for handling multimodal embeddings across disaggregated instances
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 23 comments.
Show a summary per file
| File | Description |
|---|---|
| test/srt/test_epd_disaggregation.py | Adds comprehensive tests for EPD disaggregation with single and multiple encoder configurations |
| test/srt/run_suite.py | Registers new EPD disaggregation test with 600s timeout |
| python/sglang/srt/server_args.py | Adds new CLI arguments for encoder disaggregation (encoder-only, language-only, encoder-urls, encoder-transfer-backend) with validation |
| python/sglang/srt/multimodal/processors/qwen_vl.py | Adds get_mm_data method to build multimodal data from precomputed embeddings |
| python/sglang/srt/multimodal/processors/dots_vlm.py | Renames token ID attributes from IM_START_ID/IM_END_ID to IM_START_TOKEN_ID/IM_END_TOKEN_ID for consistency |
| python/sglang/srt/multimodal/processors/base_processor.py | Adds base methods for building input IDs and processing multimodal data from embeddings |
| python/sglang/srt/models/qwen3_vl_moe.py | Adds encoder-only/language-only weight loading support to skip unnecessary model components |
| python/sglang/srt/models/qwen3_vl.py | Conditionally initializes language model components based on encoder-only mode |
| python/sglang/srt/models/qwen2_5_vl.py | Reorders initialization to handle encoder-only mode without loading language model weights |
| python/sglang/srt/models/dots_vlm.py | Adds encoder-only mode support to skip language model initialization |
| python/sglang/srt/managers/tokenizer_manager.py | Integrates MMReceiver for handling embeddings in zmq_to_tokenizer/mooncake backends |
| python/sglang/srt/managers/scheduler.py | Integrates MMReceiver for handling embeddings in zmq_to_scheduler backend |
| python/sglang/srt/managers/mm_utils.py | Enhances precomputed embedding handling with chunked prefill support |
| python/sglang/srt/managers/io_struct.py | Adds fields for tracking embedding ports and image waiting status |
| python/sglang/srt/disaggregation/encode_server.py | Implements dedicated encode server with FastAPI endpoints for encoding and sending embeddings |
| python/sglang/srt/disaggregation/encode_receiver.py | Implements MMReceiver class for receiving embeddings from encode servers |
| python/sglang/srt/configs/model_config.py | Adds encoder_only and language_only configuration parameters |
| python/sglang/launch_server.py | Routes to encode_server when encoder-only mode is specified |
| python/sglang/bench_serving.py | Adds random image count feature for benchmarking variable multimodal workloads |
Comments suppressed due to low confidence (1)
python/sglang/srt/models/qwen2_5_vl.py:680
- The condition on line 676 checks
hasattr(self, "model")before accessingself.model.start_layer. However, in encoder-only mode,self.modelis not initialized (as seen in qwen2_5_vl.py lines 481-503). While this check prevents the AttributeError, the logic should be clearer about whenmodelexists. Consider restructuring to checkself.config.encoder_onlyfirst.
if (
layer_id is not None
and hasattr(self, "model")
and hasattr(self.model, "start_layer")
and (
layer_id < self.model.start_layer
or layer_id >= self.model.end_layer
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/tag-and-rerun-ci |
|
LGTM. I will continue to monitor the CI and Nightly Test results; if any issues arise, I will rerun them and keep an eye on the outcomes. |
|
PR CI: This PR includes two key tests:
In the second-to-last commit, The final commit fixes Nightly CI:
https://github.com/sgl-project/sglang/actions/runs/20193317653/job/57995164803
In summary, the remaining CI failures are most likely due to CI instability or factors unrelated to this PR. The code changes and relevant tests for this PR are in good shape, and I believe the PR meets the criteria for merging. |
Collaboration with @liusy58, @ZhengWG and @ShangmingCai
Motivation
related issues: #8223 #11355
Modifications
--language-only) and vison-only model encode instance (--encoder-only).--encoder-only(sglang/srt/disaggregation/encode_server.py).--language-only(sglang/srt/models/qwen2_5_vl.py).--encoder-urls) for launching language-only model for prefill instance to specify list of encode urls.sglang/srt/disaggregation/encode_receiver) to eitherTokenizerManager._tokenize_one_requestor theScheduler.process_input_requests(sglang/srt/managers/scheduler) so it can receive embeddings.zmq_to_scheduler,zmq_to_tokenizerandmooncake(--encoder-transfer-backend).zmq_to_schedulerreceive embeddings at the scheduler level.zmq_to_tokenizerandmooncakereceive embeddings at the tokenizer level.Accuracy Tests
Qwen/Qwen2.5-VL-7B-Instruct
Benchmarking and Profiling
Qwen/Qwen2.5-VL-7B-Instruct, 32 reqs, 0.1 reqs/s, MMMU
Each prefill/decode/encode instance use one GPU
Extend one image per request to ten images per request
Original resolution
Resize to 1920 x 1080
*: Further eliminate the overhead of preprocessing.
**: After refactoring to
MMreceiverCurrent benchmark result on (random 1-8 images) updated on 12.4:
Qwen3-VL-235B-A22B (FP8) H20
Qwen3-VL-30B-A3B H100
Qwen2.5-7B-VL H100