Skip to content

Conversation

@ZhengWG
Copy link
Contributor

@ZhengWG ZhengWG commented Nov 5, 2025

Motivation

Introduce the Encode/Language disaggregation architecture so multimodal workloads can scale vision encoders and language decoders independently, unblock high-QPS deployments.

Modifications

  1. Adapted Qwen2_5_VLForConditionalGeneration model to support only-vision part. Adapted Qwen2/3 model, and related configs so the encode model can emit multimodal embeddings while the language plane loads only LM weights.
  2. Implemented connectors based on mooncake for multimodal data flow between vision and language disaggregated nodes, which is mainly based on PD disaggregation.
  3. Implemented event_loop for encode and language schedule process.

Inference flow is as followed:

image

For detailed design, please refer to: vlm_disaggregation.md

Accuracy Tests

Benchmarking and Profiling

Launch the Bootstrap Server

# Launch the bootstrap server on a control node
python3 -m sglang.srt.disaggregation.mini_lb --host $HOST_IP \
    --port $SERVER_PORT --vision http://${EMBEDDING_IP}:${EMBEDDING_PORT} \
    --prefill http://${LANGUAGE_IP_LIST[0]}:${LANGUAGE_PORT} \
    --enable-multimodal-disagg

Launch the Encode Service

export PORT=8001
export SGLANG_VLM_CACHE_SIZE_MB=40960
export TENSOR_PARALLEL_SIZE=2
export CHUNKED_PREFILL_SIZE=81920
export MAX_RUNNING_REQUESTS=128
export MEM_FRACTION_STATIC=0.85
export SGLANG_EMBEDDING_CACHE_BUFFER_SIZE=128
export SGLANG_EMBEDDING_CACHE_BLOCK_SIZE=16384
# Encode: vision encoding
python3 -m sglang.launch_server --model-path ${MODEL_PATH} --enable-torch-compile --max-prefill-tokens $CHUNKED_PREFILL_SIZE \
        --host $HOST_IP --port $PORT --trust-remote-code --tp-size ${TENSOR_PARALLEL_SIZE} --mem-fraction-static ${MEM_FRACTION_STATIC} \
        --enable-cache-report --log-level info --max-running-requests ${MAX_RUNNING_REQUESTS} \
        --chunked-prefill-size ${CHUNKED_PREFILL_SIZE} --attention-backend fa3 --json-model-override-args '{"is_multimodal_embedding": true}' \
        --mm-attention-backend fa3 --disaggregation-mode encode

Launch the Language Service

export PORT=8002
export TENSOR_PARALLEL_SIZE=8
export MAX_RUNNING_REQUESTS=128
export SGLANG_EMBEDDING_CACHE_BUFFER_SIZE=128
export SGLANG_EMBEDDING_CACHE_BLOCK_SIZE=16384
export MEM_FRACTION_STATIC=0.85
export CHUNKED_PREFILL_SIZE=8192
# Configure the default buffer allocation
export SGLANG_EMBEDDING_DEFAULT_ALLOCATE_BUFFER_SIZE=16384

# Language: text generation
# Qwen2.5-VL: "architectures": ["Qwen2ForCausalLM"]
python3 -m sglang.launch_server --model-path ${MODEL_PATH} --enable-torch-compile --disable-radix-cache \
        --host $HOST_IP --port $PORT --trust-remote-code --tp-size ${TENSOR_PARALLEL_SIZE} --served-model-name "qwen3-vl" \
        --enable-cache-report --log-level info --max-running-requests ${MAX_RUNNING_REQUESTS} --json-model-override-args '{"architectures": ["Qwen3MoeForCausalLM"]}' \
        --mem-fraction-static ${MEM_FRACTION_STATIC} --chunked-prefill-size ${CHUNKED_PREFILL_SIZE} --attention-backend fa3 \
        --disaggregation-mode language

Benchmark Requests

python3 -m sglang.bench_serving \
                    --host ${HOST_IP} \
                    --port ${PORT} \
                    --model $MODEL_PATH \
                    --backend sglang-oai-chat \
                    --dataset-name "image" \
                    --random-input-len $input_len \
                    --random-output-len $output_len \
                    --random-range-ratio 1 \
                    --num-prompt $num_prompt \
                    --warmup-requests 0 \
                    --flush-cache \
                    --image-count 1 \
                    --image-resolution $image_size \
                    --image-format "jpeg" \
                    --image-content "random" \
                    --request-rate $qps \
                    --output-file $result_file \
                    --max-concurrency 128

Performance Metrics

SLA: mean TTFT < 4 s; mean TPOT < 100 ms.

Single Node (TP=8)

Model Scenario qps/gpu TTFT (Mean) TPOT (Mean)
qwen2.5-vl-72B Single image + text (2000×2000 + 1k), output 300, qps: 0.52 0.06625 req/s/gpu 3826.07 ms 78.63 ms
qwen3-vl-235B-A22B Single image + text (2000×2000 + 1k), output 300, qps: 0.85 0.1075 req/s/gpu 3738.24 ms 91.24 ms

Disaggregated Deployment

Configuration Encode Plane Language Plane
GPU model NVIDIA H20 96GB NVIDIA H20 96GB
GPU count 2 (TP = 2) 8 (TP = 8)
Model Scenario qps/gpu TTFT (Mean) TPOT (Mean)
qwen2.5-vl-72B Single image + text (2000×2000 + 1k), output 300, qps: 0.78 0.07420 (+12%) req/s/gpu 3632.70 ms 95.61 ms
qwen3-vl-235B-A22B Single image + text (2000×2000 + 1k), output 300, qps: 1.40 0.141 (+31.2%) req/s/gpu 3831.58 ms 95.34 ms

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ZhengWG
Copy link
Contributor Author

ZhengWG commented Nov 5, 2025

@yuan-luo Can you help review it?

@yuan-luo yuan-luo self-assigned this Nov 5, 2025
@yuan-luo
Copy link
Collaborator

yuan-luo commented Nov 5, 2025

Refer to the architecture of EPD:
image

EPDServe implements the Encode-Prefill-Decode (EPD) Disaggregation architecture proposed in the EPD paper. It is designed to serve large multimodal models (LMMs) efficiently by splitting the inference pipeline into three independent stages:
🎨 Encoding Stage: Processes multimodal inputs (images, audio, video)
🧠 Context (Prefill) Stage: Handles prompt token prefill
🔁 Decoding Stage: Performs autoregressive token generation

https://github.com/vbdi/epdserve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants