[AMD] Add MTP support for DeepSeek R1 FP8 MI355X SGLang#628
[AMD] Add MTP support for DeepSeek R1 FP8 MI355X SGLang#628benenzhu wants to merge 6 commits intoInferenceMAX:mainfrom
Conversation
- Add benchmark script benchmarks/dsr1_fp8_mi355x_mtp.sh with EAGLE speculative decoding - Add dsr1-fp8-mi355x-sglang-mtp config entry to .github/configs/amd-master.yaml - Update runners/launch_mi355x-amds.sh to support SPEC_SUFFIX for MTP script selection - Add perf-changelog entry documenting the changes Configurations: TP=8, concurrency 4-128 for 1k1k, 1k8k, and 8k1k sequence lengths Image: lmsysorg/sglang:v0.5.8-rocm700-mi35x Co-authored-by: Todd zhenchen@amd.com
|
/sweep test-config dsr1-fp8-mi355x-sglang-mtp --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml |
|
@cquil11 Kicking off a sweep. Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/21689411588 |
|
/sweep test-config --config-keys dsr1-fp8-mi355x-sglang-mtp --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml |
|
@cquil11 Kicking off a sweep. Run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/21689430926 |
Oh, failed with some wired error at graph capture stage. I will convert it to draft and try fix locally then. Also need some tune for the use-chat-template config. |
Summary
This PR adds MTP (Multi-Token Prediction) support for DeepSeek R1 FP8 on MI355X using SGLang with EAGLE speculative decoding.
SGLang's EAGLE implementation uses DeepSeek R1's native MTP weights to speculatively decode 3 additional tokens per forward pass. This PR uses the default EAGLE configurations.
Changes
benchmarks/dsr1_fp8_mi355x_mtp.shwith EAGLE speculative decoding configurationdsr1-fp8-mi355x-sglang-mtpconfig entry to.github/configs/amd-master.yamlrunners/launch_mi355x-amds.shto supportSPEC_SUFFIXfor MTP script selectionConfiguration
New SGLang Arguments
--speculative-algorithm--max-running-requests--cuda-graph-max-bs