Skip to content

Conversation

@nv-guomingz
Copy link
Collaborator

@nv-guomingz nv-guomingz commented Sep 22, 2025

It's a mirror PR of #7848

Summary by CodeRabbit

  • Documentation
    • Added RTX 6000 Pro Blackwell Server Edition benchmarks, with per-network and per-hardware throughput tables.
    • Expanded FP4/FP8 model listings, including Llama 4 Scout, Qwen, and DeepSeek variants; replaced prior Llama 4 Maverick FP8 sections with Scout data.
    • Introduced ep_size flag and --ep usage in benchmark reproduction and options.
    • Updated examples to specify TP/PP and EP for MoE workflows.
    • Added MoE guidance and YAML snippets for attention dropout and padding.
    • Enhanced troubleshooting for KV cache tuning and memory considerations.

Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
@nv-guomingz nv-guomingz requested a review from a team as a code owner September 22, 2025 04:46
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 22, 2025

📝 Walkthrough

Walkthrough

Adds new RTX 6000 Pro Blackwell Server Edition benchmarks and tables; expands FP4/FP8 model listings; replaces Llama 4 Maverick FP8 with Llama 4 Scout data; documents new ep_size/--ep expert-parallelism flag; updates example commands (TP/PP/EP); adds MoE YAML snippets; updates troubleshooting for KV cache and memory.

Changes

Cohort / File(s) Summary
Benchmark data additions (Blackwell)
docs/source/performance/perf-overview.md
Added RTX 6000 Pro Blackwell Server Edition throughput benchmarks with PP/TP combinations; introduced dedicated tables per network/hardware.
Model listings updates (FP4/FP8)
docs/source/performance/perf-overview.md
Expanded FP4/FP8 model entries: added Qwen3-235B-A22B/30B, DeepSeek-R1-0528 (FP4/FP8), multiple Llama/Qwen/DeepSeek variants.
Llama 4 Scout migration
docs/source/performance/perf-overview.md
Replaced Llama 4 Maverick FP8 sections with Llama 4 Scout; consolidated FP4/FP8 tables under Scout branding.
Command syntax and flags
docs/source/performance/perf-overview.md
Introduced ep_size and --ep flag for expert parallelism; updated Reproducing Results and Variables sections with EP usage.
Example workflows (TP/PP/EP)
docs/source/performance/perf-overview.md
Updated trtllm-bench examples to specify TP/PP and EP for MoE; included MoE-specific usage.
MoE configuration snippets
docs/source/performance/perf-overview.md
Added YAML snippets for attention dropout and padding in llm_options.yml for MoE.
Troubleshooting updates
docs/source/performance/perf-overview.md
Expanded guidance on KV cache percentage tuning and memory considerations for MoE and dense models.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant D as Perf Docs
  participant B as trtllm-bench
  participant R as TRT-LLM Runtime
  participant G as GPU(s)

  U->>D: Read benchmark guidance and flags (TP/PP/EP)
  Note over D: New flag ep_size / --ep for expert parallelism

  U->>B: Run with --tp, --pp [--ep]
  B->>R: Configure execution plan (TP/PP[/EP])
  alt Dense model
    R->>G: Launch kernels with TP/PP
  else MoE model (EP enabled)
    R->>G: Route tokens to experts (EP), apply dropout/padding
  end
  R-->>B: Throughput/latency results
  B-->>U: Output benchmark tables/logs
  Note over U,R: Troubleshoot KV cache % / memory if needed
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The PR title "add initial v1.0 data without rtx 6000 blackwell se" is misleading because the changeset (docs/source/performance/perf-overview.md) explicitly adds RTX 6000 Pro Blackwell Server Edition benchmarks and related updates, so the "without rtx 6000 blackwell se" clause contradicts the actual changes; the title does reference adding v1.0 data but inaccurately describes a major aspect of the change. Please update the PR title to accurately reflect the main change and follow the repository title template, for example: "[None][doc] Add initial v1.0 performance data including RTX 6000 Pro Blackwell SE" or similar that removes the contradictory "without" clause and includes a ticket/issue reference if applicable.
Description Check ⚠️ Warning The PR description contains only "It's a mirror PR of #7848" and does not follow the repository's required template; it is missing the Description detailing what and why, Test Coverage, and the PR Checklist, so reviewers cannot assess tests or impact from the body of this PR alone. Please replace the PR body with the repository template content: include a concise Description summarizing the changes and rationale, list relevant Test Coverage items, and complete the PR Checklist (or explain omissions); also note the original PR (#7848) and any relevant ticket IDs to aid review.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
docs/source/performance/perf-overview.md (3)

75-85: Fix MD058: add blank lines around tables.

markdownlint flags these tables; add a blank line before and after each to satisfy MD058.

Apply diffs like (pattern shown for each block):

 RTX 6000 Pro Blackwell Server Edition
+
 | Sequence Length (ISL/OSL) | **4 GPUs**<br/>TP2,PP2 (FP4) | **8 GPUs**<br/>TP4,PP2 (FP4) |
 ...
 | 5000/500 | 2,662 | |
+

Repeat the same before/after the tables starting at the listed ranges.

Also applies to: 100-112, 127-138, 153-165, 191-202, 217-225


101-101: Grammar: use singular “1 GPU”.

-| Sequence Length (ISL/OSL) | **1 GPUs**<br/>TP1,PP1 (FP4) | **2 GPUs**<br/>TP1,PP2 (FP4) | **4 GPUs**<br/>TP1,PP4 (FP4) | **8 GPUs**<br/>TP1,PP8 (FP4) |
+| Sequence Length (ISL/OSL) | **1 GPU**<br/>TP1,PP1 (FP4) | **2 GPUs**<br/>TP1,PP2 (FP4) | **4 GPUs**<br/>TP1,PP4 (FP4) | **8 GPUs**<br/>TP1,PP8 (FP4) |

63-63: Define acronyms (ISL/OSL/TP/PP/EP) and clarify ep_size scope.

  • Add a one‑liner early to decode acronyms; many tables use “ISL/OSL” before Variables are introduced.
  • For ep_size, briefly clarify it applies to MoE models only and how it composes with TP/PP.

Proposed insertion (place near the top, after “Throughput Measurements” intro paragraph):

Note: ISL = Input Sequence Length, OSL = Output Sequence Length, TP = Tensor Parallelism, PP = Pipeline Parallelism, EP = Expert Parallelism (for MoE models).

Also applies to: 268-269, 315-319, 343-347

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af3ea37 and 2f066da.

📒 Files selected for processing (1)
  • docs/source/performance/perf-overview.md (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/performance/perf-overview.md
🪛 markdownlint-cli2 (0.18.1)
docs/source/performance/perf-overview.md

76-76: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


101-101: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


128-128: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


154-154: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


192-192: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


218-218: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
docs/source/performance/perf-overview.md (4)

26-30: Confirm product naming and PR title intent for “RTX 6000 Pro Blackwell Server Edition”.

  • The doc adds “RTX 6000 Pro Blackwell Server Edition” data, while the PR title says “without rtx 6000 blackwell se”. Please confirm whether RTX 6000 Blackwell SE content should be included in v1.0 and align the PR title and section headings accordingly.
  • Also verify the official branding string (e.g., “NVIDIA RTX 6000 Blackwell”, “RTX 6000 Pro Blackwell”, “Server Edition”) and use it consistently across the Hardware list and tables.

Also applies to: 37-38, 75-75, 100-100, 127-127, 153-153, 191-191, 217-217


177-190: Scout vs Maverick: confirm coexistence or replacement.

Doc adds “Llama 4 Scout” and still contains “Llama 4 Maverick.” If Scout supersedes Maverick for FP8, either:

  • keep both with a short note explaining differences/use‑cases, or
  • remove Maverick to avoid confusion.

Also applies to: 191-202


46-49: Model lists vs tables: ensure consistency.

The FP4/FP8 model lists add Qwen3/DeepSeek entries; double‑check every listed model has corresponding tables (and vice versa), or add a brief note if some are intentionally omitted.

Also applies to: 58-59


351-351: Confirm YAML key enable_attention_dp — verified.

LLM args declare enable_attention_dp (tensorrt_llm/llmapi/llm_args.py: ~218 and Field at ~1169 with description "Enable attention data parallel") and the flag is referenced across CLI, tests and runtime code — keep the doc key as-is.

@nv-guomingz nv-guomingz deleted the 1.0_perf branch September 22, 2025 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants