[VLM] Support Encode/Language Model Dissaggregation for Qwen #12665

ZhengWG · 2025-11-05T03:13:39Z

Motivation

Introduce the Encode/Language disaggregation architecture so multimodal workloads can scale vision encoders and language decoders independently, unblock high-QPS deployments.

Modifications

Adapted Qwen2_5_VLForConditionalGeneration model to support only-vision part. Adapted Qwen2/3 model, and related configs so the encode model can emit multimodal embeddings while the language plane loads only LM weights.
Implemented connectors based on mooncake for multimodal data flow between vision and language disaggregated nodes, which is mainly based on PD disaggregation.
Implemented event_loop for encode and language schedule process.

Inference flow is as followed:

For detailed design, please refer to: vlm_disaggregation.md

Accuracy Tests

Benchmarking and Profiling

Launch the Bootstrap Server

# Launch the bootstrap server on a control node
python3 -m sglang.srt.disaggregation.mini_lb --host $HOST_IP \
    --port $SERVER_PORT --vision http://${EMBEDDING_IP}:${EMBEDDING_PORT} \
    --prefill http://${LANGUAGE_IP_LIST[0]}:${LANGUAGE_PORT} \
    --enable-multimodal-disagg

Launch the Encode Service

export PORT=8001
export SGLANG_VLM_CACHE_SIZE_MB=40960
export TENSOR_PARALLEL_SIZE=2
export CHUNKED_PREFILL_SIZE=81920
export MAX_RUNNING_REQUESTS=128
export MEM_FRACTION_STATIC=0.85
export SGLANG_EMBEDDING_CACHE_BUFFER_SIZE=128
export SGLANG_EMBEDDING_CACHE_BLOCK_SIZE=16384
# Encode: vision encoding
python3 -m sglang.launch_server --model-path ${MODEL_PATH} --enable-torch-compile --max-prefill-tokens $CHUNKED_PREFILL_SIZE \
        --host $HOST_IP --port $PORT --trust-remote-code --tp-size ${TENSOR_PARALLEL_SIZE} --mem-fraction-static ${MEM_FRACTION_STATIC} \
        --enable-cache-report --log-level info --max-running-requests ${MAX_RUNNING_REQUESTS} \
        --chunked-prefill-size ${CHUNKED_PREFILL_SIZE} --attention-backend fa3 --json-model-override-args '{"is_multimodal_embedding": true}' \
        --mm-attention-backend fa3 --disaggregation-mode encode

Launch the Language Service

export PORT=8002
export TENSOR_PARALLEL_SIZE=8
export MAX_RUNNING_REQUESTS=128
export SGLANG_EMBEDDING_CACHE_BUFFER_SIZE=128
export SGLANG_EMBEDDING_CACHE_BLOCK_SIZE=16384
export MEM_FRACTION_STATIC=0.85
export CHUNKED_PREFILL_SIZE=8192
# Configure the default buffer allocation
export SGLANG_EMBEDDING_DEFAULT_ALLOCATE_BUFFER_SIZE=16384

# Language: text generation
# Qwen2.5-VL: "architectures": ["Qwen2ForCausalLM"]
python3 -m sglang.launch_server --model-path ${MODEL_PATH} --enable-torch-compile --disable-radix-cache \
        --host $HOST_IP --port $PORT --trust-remote-code --tp-size ${TENSOR_PARALLEL_SIZE} --served-model-name "qwen3-vl" \
        --enable-cache-report --log-level info --max-running-requests ${MAX_RUNNING_REQUESTS} --json-model-override-args '{"architectures": ["Qwen3MoeForCausalLM"]}' \
        --mem-fraction-static ${MEM_FRACTION_STATIC} --chunked-prefill-size ${CHUNKED_PREFILL_SIZE} --attention-backend fa3 \
        --disaggregation-mode language

Benchmark Requests

python3 -m sglang.bench_serving \
                    --host ${HOST_IP} \
                    --port ${PORT} \
                    --model $MODEL_PATH \
                    --backend sglang-oai-chat \
                    --dataset-name "image" \
                    --random-input-len $input_len \
                    --random-output-len $output_len \
                    --random-range-ratio 1 \
                    --num-prompt $num_prompt \
                    --warmup-requests 0 \
                    --flush-cache \
                    --image-count 1 \
                    --image-resolution $image_size \
                    --image-format "jpeg" \
                    --image-content "random" \
                    --request-rate $qps \
                    --output-file $result_file \
                    --max-concurrency 128

Performance Metrics

SLA: mean TTFT < 4 s; mean TPOT < 100 ms.

Single Node (TP=8)

Model	Scenario	qps/gpu	TTFT (Mean)	TPOT (Mean)
qwen2.5-vl-72B	Single image + text (2000×2000 + 1k), output 300, qps: 0.52	0.06625 req/s/gpu	3826.07 ms	78.63 ms
qwen3-vl-235B-A22B	Single image + text (2000×2000 + 1k), output 300, qps: 0.85	0.1075 req/s/gpu	3738.24 ms	91.24 ms

Disaggregated Deployment

Configuration	Encode Plane	Language Plane
GPU model	NVIDIA H20 96GB	NVIDIA H20 96GB
GPU count	2 (TP = 2)	8 (TP = 8)

Model	Scenario	qps/gpu	TTFT (Mean)	TPOT (Mean)
qwen2.5-vl-72B	Single image + text (2000×2000 + 1k), output 300, qps: 0.78	0.07420 (+12%) req/s/gpu	3632.70 ms	95.61 ms
qwen3-vl-235B-A22B	Single image + text (2000×2000 + 1k), output 300, qps: 1.40	0.141 (+31.2%) req/s/gpu	3831.58 ms	95.34 ms

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-11-05T03:13:43Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ZhengWG · 2025-11-05T04:02:12Z

@yuan-luo Can you help review it?

yuan-luo · 2025-11-05T04:11:25Z

Refer to the architecture of EPD:

EPDServe implements the Encode-Prefill-Decode (EPD) Disaggregation architecture proposed in the EPD paper. It is designed to serve large multimodal models (LMMs) efficiently by splitting the inference pipeline into three independent stages:
🎨 Encoding Stage: Processes multimodal inputs (images, audio, video)
🧠 Context (Prefill) Stage: Handles prompt token prefill
🔁 Decoding Stage: Performs autoregressive token generation

https://github.com/vbdi/epdserve

ZhengWG added 11 commits October 28, 2025 17:23

feat: support VL EPD for Qwen2.5

642f591

Refactor multimodal embedding to use block-based allocation

c89268a

feat: support for resume-transfer

845f6c4

refactor: move deepstack support to Qwen3MoeForCausalLM

88c9e7a

fix: adapt fused_experts in qwen3-moe load_weights

04cdb6f

feat: support vl-dis for qwen3-vl-moe

5a0465f

bench: add uniform mode & fix disaggregation error

f441dea

Doc: add doc for epd

32cdef6

refactor to merge conn_multimodal

8532c95

doc: update performance for vlm-dis

5f39ccf

clean code

a0bd913

ZhengWG requested review from ByronHsu, CatherineSue, JustinTong0323, ShangmingCai, Ying1123, hnyls2002, ispobock, merrymercy, mickqian, slin1237, xiezhq-hermann and zhyncs as code owners November 5, 2025 03:13

ZhengWG mentioned this pull request Nov 5, 2025

[WIP][Feature] Support Vision/Language Model Dissaggregation for Qwen2.5_VL #7933

Closed

9 tasks

update english doc verison

0938770

ZhengWG force-pushed the py/vl-dis branch from d9aab3c to 0938770 Compare November 5, 2025 03:23

ShangmingCai assigned mickqian, JustinTong0323 and ShangmingCai Nov 5, 2025

yuan-luo self-assigned this Nov 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[VLM] Support Encode/Language Model Dissaggregation for Qwen #12665

[VLM] Support Encode/Language Model Dissaggregation for Qwen #12665

Uh oh!

ZhengWG commented Nov 5, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Uh oh!

ZhengWG commented Nov 5, 2025

Uh oh!

yuan-luo commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[VLM] Support Encode/Language Model Dissaggregation for Qwen #12665

Are you sure you want to change the base?

[VLM] Support Encode/Language Model Dissaggregation for Qwen #12665

Uh oh!

Conversation

ZhengWG commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Uh oh!

ZhengWG commented Nov 5, 2025

Uh oh!

yuan-luo commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ZhengWG commented Nov 5, 2025 •

edited

Loading