overlap shared + routed expert computation in kimi linear #12660

b8zhong · 2025-11-05T01:51:56Z

Before:

CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --model-path moonshotai/Kimi-Linear-48B-A3B-Instruct --tp 2 --trust-remote-code

python3 -m sglang.test.send_one

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.366    |  459   |   1.000    |     193.98      |
+-------------+--------+------------+-----------------+

After:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.230    |  459   |   1.000    |     205.84      |
+-------------+--------+------------+-----------------+

Around 6%.

python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --parallel 1319
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:52<00:00, 25.15it/s]
Accuracy: 0.895
Invalid: 0.000
Latency: 52.663 s
Output throughput: 2460.908 token/s

gemini-code-assist · 2025-11-05T01:51:59Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copilot

Pull Request Overview

This PR implements dual-stream CUDA execution for the Kimi Linear model's MoE (Mixture of Experts) layer to improve performance by overlapping computation of shared experts and routed experts on separate CUDA streams. The implementation mirrors the pattern used in other MoE models like DeepSeek V2, GLM4 MoE, Qwen2 MoE, and Bailing MoE.

Adds dual-stream support with an alternative CUDA stream for parallelizing shared expert and routed expert computations
Applies dual-stream optimization only during CUDA graph capture mode for small batches (≤1024 tokens)
Creates a single shared alternative stream at the model level, passed down to decoder layers and MoE modules

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/sglang/srt/models/kimi_linear.py

more

13cabf2

b8zhong requested a review from Copilot November 5, 2025 01:52

b8zhong added the run-ci label Nov 5, 2025

Merge branch 'main' into kimi-linear-fused

7bce724

Copilot AI reviewed Nov 5, 2025

View reviewed changes

python/sglang/srt/models/kimi_linear.py Show resolved Hide resolved

python/sglang/srt/models/kimi_linear.py Show resolved Hide resolved

Merge branch 'main' into kimi-linear-fused

a64d6d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

overlap shared + routed expert computation in kimi linear #12660

overlap shared + routed expert computation in kimi linear #12660

b8zhong commented Nov 5, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

overlap shared + routed expert computation in kimi linear #12660

Are you sure you want to change the base?

overlap shared + routed expert computation in kimi linear #12660

Conversation

b8zhong commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

b8zhong commented Nov 5, 2025 •

edited

Loading