Skip to content

[WIP] [NV] update DSR1 SGLang MTP configs on single node B200#631

Open
zbpatel wants to merge 13 commits intomainfrom
zpatel/sglang_b200
Open

[WIP] [NV] update DSR1 SGLang MTP configs on single node B200#631
zbpatel wants to merge 13 commits intomainfrom
zpatel/sglang_b200

Conversation

@zbpatel
Copy link
Collaborator

@zbpatel zbpatel commented Feb 4, 2026

This PR updates the Nvidia B200 configs for sglang DSR1 agg configs and is a follow up to #626 . Currently I branched from yunzhoul/update-sglang-mtp-configs because there are some changes needed for this, I will clean up the git branching once that is merged in.

@functionstackx
Copy link
Contributor

@claude review plz

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Feb 5, 2026

Claude finished @functionstackx's task in 1m 21s —— View job


PR Review Complete

  • Gather context and read changed files
  • Review .github/configs/nvidia-master.yaml changes
  • Review benchmarks/dsr1_fp8_b200_mtp.sh (new file)
  • Review perf-changelog.yaml changes
  • Review runners/launch_b200-dgxc.sh changes
  • Provide comprehensive review feedback

Summary

This PR adds MTP (Multi-Token Prediction) support for DeepSeek R1 FP8 on B200 using SGLang's EAGLE speculative decoding. It includes a new benchmark script, config updates, and launcher modifications.

Changes Overview

File Change
nvidia-master.yaml Bump sglang image to v0.5.8-cu130, add new dsr1-fp8-b200-sglang-mtp config
dsr1_fp8_b200_mtp.sh New benchmark script with EAGLE speculative decoding
perf-changelog.yaml Document the new MTP config
launch_b200-dgxc.sh Add SPEC_SUFFIX for MTP script selection

Issues Found

🔴 High Priority

1. Incorrect PR link in perf-changelog.yaml

The pr-link references PR #626 but this is PR #631:

pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/626  # Should be #631

Fix this →

2. Inconsistency: --use-chat-template added to MTP script but not in non-MTP version

dsr1_fp8_b200_mtp.sh:93 adds --use-chat-template flag, but the non-MTP dsr1_fp8_b200.sh doesn't use it. This could lead to inconsistent benchmark results between MTP and non-MTP runs. Is this intentional?

🟡 Medium Priority

3. Higher MAX_RUNNING_REQUESTS / CUDA_GRAPH_MAX_BATCH_SIZE values

MTP script uses 512 vs 128 in non-MTP:

  • dsr1_fp8_b200_mtp.sh:44-45: MAX_RUNNING_REQUESTS=512, CUDA_GRAPH_MAX_BATCH_SIZE=512
  • dsr1_fp8_b200.sh:38-39: MAX_RUNNING_REQUESTS=128, CUDA_GRAPH_MAX_BATCH_SIZE=128

This is consistent with the higher conc-end: 512 in the config (vs conc-end: 64 non-MTP), but the comment says it should match maximum concurrency. Worth confirming these memory requirements have been validated.

4. Changelog description says "concurrency 4-64" but config has "conc-end: 512"

perf-changelog.yaml states:

"Configurations: TP=8, EP=1, concurrency 4-64 for 1k1k, 1k8k, and 8k1k sequence lengths"

But the actual config shows conc-start: 4, conc-end: 512:

- { tp: 8, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }

Fix this →

🟢 Low Priority / Suggestions

5. MTP script only supports TP=8, but this is consistent with the config

The script has a guard at line 31-34:

if [[ $TP -ne 8 ]]; then
  echo "MTP only supports TP=8, got TP=$TP!"
  exit 1
fi

This is good and matches the config which only has TP=8 configurations. Just noting for awareness.

6. Non-MTP FP4 config also updated to v0.5.8

The dsr1-fp4-b200-sglang config was also updated from v0.5.6 to v0.5.8. This is expected for consistency, just noting it's included.


Code Quality

Good practices observed:

  • Clear variable naming and documentation in the benchmark script
  • Appropriate use of SPEC_SUFFIX pattern in launcher (consistent with existing FRAMEWORK_SUFFIX)
  • Environment variables properly checked at script start
  • Server readiness wait before benchmarking
  • Cleanup with graceful docker stop

EAGLE speculative decoding config looks reasonable:

  • --speculative-num-steps 2
  • --speculative-num-draft-tokens 3
  • --speculative-eagle-topk 1

Questions for the Author

  1. Is the --use-chat-template flag intentionally added only to the MTP version? If so, should it be backported to the non-MTP script for consistency?

  2. Have the memory requirements been validated for 512 concurrent requests with EAGLE speculative decoding?

  3. The PR description mentions this is branched from yunzhoul/update-sglang-mtp-configs - is that PR ([NV] add DSR1 SGLang MTP configs on single node B200 #626) now merged, or does this PR include changes from both?


Recommendation

Request changes - Please fix the pr-link and concurrency description in perf-changelog.yaml before merging.
Branch: zpatel/sglang_b200

@functionstackx functionstackx changed the title [WIP] [NV] Update B200 DSR1 Sglang Agg Configs [WIP] [NV] update DSR1 SGLang MTP configs on single node B200 Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants