Skip to content

Conversation

@b8zhong
Copy link
Collaborator

@b8zhong b8zhong commented Nov 5, 2025

Before:

CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --model-path moonshotai/Kimi-Linear-48B-A3B-Instruct --tp 2 --trust-remote-code

python3 -m sglang.test.send_one

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.366    |  459   |   1.000    |     193.98      |
+-------------+--------+------------+-----------------+

After:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.230    |  459   |   1.000    |     205.84      |
+-------------+--------+------------+-----------------+

Around 6%.

python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --parallel 1319
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:52<00:00, 25.15it/s]
Accuracy: 0.895
Invalid: 0.000
Latency: 52.663 s
Output throughput: 2460.908 token/s

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@b8zhong b8zhong requested a review from Copilot November 5, 2025 01:52
@b8zhong b8zhong added the run-ci label Nov 5, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements dual-stream CUDA execution for the Kimi Linear model's MoE (Mixture of Experts) layer to improve performance by overlapping computation of shared experts and routed experts on separate CUDA streams. The implementation mirrors the pattern used in other MoE models like DeepSeek V2, GLM4 MoE, Qwen2 MoE, and Bailing MoE.

  • Adds dual-stream support with an alternative CUDA stream for parallelizing shared expert and routed expert computations
  • Applies dual-stream optimization only during CUDA graph capture mode for small batches (≤1024 tokens)
  • Creates a single shared alternative stream at the model level, passed down to decoder layers and MoE modules

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant