Skip to content

Conversation

@b8zhong
Copy link
Collaborator

@b8zhong b8zhong commented Dec 20, 2025

Motivation

After flashinfer-ai/flashinfer#2131 in Flashinfer, we can benefit from SwapAB, where the input order is swapped to benefit when the M dimension is < 32 (e.g when BS < 32 in decoding). When it is larger, there is no benefit.

Modifications

(Requires Flashinfer nightly, and the backend currently only supports SM90)
Note that Flashinfer will compile it's own DeepGEMM. So it is separate from the DeepGEMM built in the Docker container.

Accuracy Tests

Benchmarking and Profiling

for ((N=1; N<=128; N*=2)); do
  python3 -m sglang.bench_serving \
    --backend sglang \
    --flush-cache \
    --dataset-name random \
    --random-input-len 1024 \
    --random-output-len 1024 \
    --random-range-ratio 1.0 \
    --num-prompts $((6*N)) \
    --max-concurrency $N \
    --output-file res.jsonl
done

We can see that when the M dimension is small, there is around a 5-8% E2E benefit

image image

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant