add initial v1.0 data without rtx 6000 blackwell se #7894

nv-guomingz · 2025-09-22T04:46:00Z

It's a mirror PR of #7848

Summary by CodeRabbit

Documentation
- Added RTX 6000 Pro Blackwell Server Edition benchmarks, with per-network and per-hardware throughput tables.
- Expanded FP4/FP8 model listings, including Llama 4 Scout, Qwen, and DeepSeek variants; replaced prior Llama 4 Maverick FP8 sections with Scout data.
- Introduced ep_size flag and --ep usage in benchmark reproduction and options.
- Updated examples to specify TP/PP and EP for MoE workflows.
- Added MoE guidance and YAML snippets for attention dropout and padding.
- Enhanced troubleshooting for KV cache tuning and memory considerations.

Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

coderabbitai · 2025-09-22T04:46:06Z

📝 Walkthrough

Walkthrough

Adds new RTX 6000 Pro Blackwell Server Edition benchmarks and tables; expands FP4/FP8 model listings; replaces Llama 4 Maverick FP8 with Llama 4 Scout data; documents new ep_size/--ep expert-parallelism flag; updates example commands (TP/PP/EP); adds MoE YAML snippets; updates troubleshooting for KV cache and memory.

Changes

Cohort / File(s)	Summary
Benchmark data additions (Blackwell) `docs/source/performance/perf-overview.md`	Added RTX 6000 Pro Blackwell Server Edition throughput benchmarks with PP/TP combinations; introduced dedicated tables per network/hardware.
Model listings updates (FP4/FP8) `docs/source/performance/perf-overview.md`	Expanded FP4/FP8 model entries: added Qwen3-235B-A22B/30B, DeepSeek-R1-0528 (FP4/FP8), multiple Llama/Qwen/DeepSeek variants.
Llama 4 Scout migration `docs/source/performance/perf-overview.md`	Replaced Llama 4 Maverick FP8 sections with Llama 4 Scout; consolidated FP4/FP8 tables under Scout branding.
Command syntax and flags `docs/source/performance/perf-overview.md`	Introduced ep_size and --ep flag for expert parallelism; updated Reproducing Results and Variables sections with EP usage.
Example workflows (TP/PP/EP) `docs/source/performance/perf-overview.md`	Updated trtllm-bench examples to specify TP/PP and EP for MoE; included MoE-specific usage.
MoE configuration snippets `docs/source/performance/perf-overview.md`	Added YAML snippets for attention dropout and padding in llm_options.yml for MoE.
Troubleshooting updates `docs/source/performance/perf-overview.md`	Expanded guidance on KV cache percentage tuning and memory considerations for MoE and dense models.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant D as Perf Docs
  participant B as trtllm-bench
  participant R as TRT-LLM Runtime
  participant G as GPU(s)

  U->>D: Read benchmark guidance and flags (TP/PP/EP)
  Note over D: New flag ep_size / --ep for expert parallelism

  U->>B: Run with --tp, --pp [--ep]
  B->>R: Configure execution plan (TP/PP[/EP])
  alt Dense model
    R->>G: Launch kernels with TP/PP
  else MoE model (EP enabled)
    R->>G: Route tokens to experts (EP), apply dropout/padding
  end
  R-->>B: Throughput/latency results
  B-->>U: Output benchmark tables/logs
  Note over U,R: Troubleshoot KV cache % / memory if needed

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The PR title "add initial v1.0 data without rtx 6000 blackwell se" is misleading because the changeset (docs/source/performance/perf-overview.md) explicitly adds RTX 6000 Pro Blackwell Server Edition benchmarks and related updates, so the "without rtx 6000 blackwell se" clause contradicts the actual changes; the title does reference adding v1.0 data but inaccurately describes a major aspect of the change.	Please update the PR title to accurately reflect the main change and follow the repository title template, for example: "[None][doc] Add initial v1.0 performance data including RTX 6000 Pro Blackwell SE" or similar that removes the contradictory "without" clause and includes a ticket/issue reference if applicable.
Description Check	⚠️ Warning	The PR description contains only "It's a mirror PR of #7848" and does not follow the repository's required template; it is missing the Description detailing what and why, Test Coverage, and the PR Checklist, so reviewers cannot assess tests or impact from the body of this PR alone.	Please replace the PR body with the repository template content: include a concise Description summarizing the changes and rationale, list relevant Test Coverage items, and complete the PR Checklist (or explain omissions); also note the original PR (#7848) and any relevant ticket IDs to aid review.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

docs/source/performance/perf-overview.md (3)
75-85: Fix MD058: add blank lines around tables.

markdownlint flags these tables; add a blank line before and after each to satisfy MD058.

Apply diffs like (pattern shown for each block):
 RTX 6000 Pro Blackwell Server Edition
+
 | Sequence Length (ISL/OSL) | **4 GPUs** TP2,PP2 (FP4) | **8 GPUs** TP4,PP2 (FP4) |
 ...
 | 5000/500 | 2,662 | |
+
Repeat the same before/after the tables starting at the listed ranges.

Also applies to: 100-112, 127-138, 153-165, 191-202, 217-225

101-101: Grammar: use singular “1 GPU”.
-| Sequence Length (ISL/OSL) | **1 GPUs** TP1,PP1 (FP4) | **2 GPUs** TP1,PP2 (FP4) | **4 GPUs** TP1,PP4 (FP4) | **8 GPUs** TP1,PP8 (FP4) |
+| Sequence Length (ISL/OSL) | **1 GPU** TP1,PP1 (FP4) | **2 GPUs** TP1,PP2 (FP4) | **4 GPUs** TP1,PP4 (FP4) | **8 GPUs** TP1,PP8 (FP4) |
63-63: Define acronyms (ISL/OSL/TP/PP/EP) and clarify ep_size scope.

Add a one‑liner early to decode acronyms; many tables use “ISL/OSL” before Variables are introduced.

For ep_size, briefly clarify it applies to MoE models only and how it composes with TP/PP.

Proposed insertion (place near the top, after “Throughput Measurements” intro paragraph):
Note: ISL = Input Sequence Length, OSL = Output Sequence Length, TP = Tensor Parallelism, PP = Pipeline Parallelism, EP = Expert Parallelism (for MoE models).
Also applies to: 268-269, 315-319, 343-347

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af3ea37 and 2f066da.

📒 Files selected for processing (1)

docs/source/performance/perf-overview.md (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/performance/perf-overview.md

🪛 markdownlint-cli2 (0.18.1)

docs/source/performance/perf-overview.md

76-76: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

101-101: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

128-128: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

154-154: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

192-192: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

218-218: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (4)

docs/source/performance/perf-overview.md (4)

26-30: Confirm product naming and PR title intent for “RTX 6000 Pro Blackwell Server Edition”.

The doc adds “RTX 6000 Pro Blackwell Server Edition” data, while the PR title says “without rtx 6000 blackwell se”. Please confirm whether RTX 6000 Blackwell SE content should be included in v1.0 and align the PR title and section headings accordingly.

Also verify the official branding string (e.g., “NVIDIA RTX 6000 Blackwell”, “RTX 6000 Pro Blackwell”, “Server Edition”) and use it consistently across the Hardware list and tables.

Also applies to: 37-38, 75-75, 100-100, 127-127, 153-153, 191-191, 217-217

177-190: Scout vs Maverick: confirm coexistence or replacement.

Doc adds “Llama 4 Scout” and still contains “Llama 4 Maverick.” If Scout supersedes Maverick for FP8, either:

keep both with a short note explaining differences/use‑cases, or

remove Maverick to avoid confusion.

Also applies to: 191-202

46-49: Model lists vs tables: ensure consistency.

The FP4/FP8 model lists add Qwen3/DeepSeek entries; double‑check every listed model has corresponding tables (and vice versa), or add a brief note if some are intentionally omitted.

Also applies to: 58-59

351-351: Confirm YAML key enable_attention_dp — verified.

LLM args declare enable_attention_dp (tensorrt_llm/llmapi/llm_args.py: ~218 and Field at ~1169 with description "Enable attention data parallel") and the flag is referenced across CLI, tests and runtime code — keep the doc key as-is.

add initial v1.0 data without rtx 6000 blackwell se

2f066da

Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

nv-guomingz requested a review from a team as a code owner September 22, 2025 04:46

nv-guomingz closed this Sep 22, 2025

coderabbitai bot reviewed Sep 22, 2025

View reviewed changes

nv-guomingz deleted the 1.0_perf branch September 22, 2025 04:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add initial v1.0 data without rtx 6000 blackwell se #7894

add initial v1.0 data without rtx 6000 blackwell se #7894

Uh oh!

nv-guomingz commented Sep 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 22, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add initial v1.0 data without rtx 6000 blackwell se #7894

add initial v1.0 data without rtx 6000 blackwell se #7894

Uh oh!

Conversation

nv-guomingz commented Sep 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nv-guomingz commented Sep 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 22, 2025 •

edited

Loading