Skip to content

Conversation

@jiahanc
Copy link
Collaborator

@jiahanc jiahanc commented Nov 17, 2025

Summary by CodeRabbit

  • Documentation

    • Updated DeepSeek v3 model configuration examples and documentation with revised deployment settings.
  • Chores

    • Adjusted CUDA graph configuration parameters and increased batch size limits in the OpenAI-compatible API server examples for optimized performance configuration options.

Description

Update DS-R1 example guide to disable the KV reuse and reset cuda graph size

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
@jiahanc jiahanc requested review from a team as code owners November 17, 2025 17:33
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 17, 2025

📝 Walkthrough

Walkthrough

Updated the Deepseek V3 README with configuration modifications to disable CUDA block reuse and increase maximum batch size from 1024 to 2048 across multiple example configurations.

Changes

Cohort / File(s) Change Summary
Configuration updates in DeepSeek V3 docs
examples/models/core/deepseek_v3/README.md
Added enable_block_reuse: false under cuda_graph_config in multiple extra-llm-api-config.yml example blocks; replaced explicit batch_sizes entries with the disable flag; increased --max_batch_size from 1024 to 2048 in trtllm-serve server configuration.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

  • Changes are homogeneous configuration updates (same pattern repeated across multiple examples)
  • No logic modifications or structural changes
  • All edits are cosmetic/configuration-based within a documentation file

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description is minimal and lacks context about why changes are needed. It references disabling KV reuse and resetting CUDA graph size but does not explain the rationale, impact, or specific test coverage. Expand the description to explain why KV reuse is being disabled and why CUDA graph size is being reset. Add specific details about test coverage and which test cases validate these configuration changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][doc] Update DS-R1 example doc' follows the required format and accurately describes the changes: documentation updates to the DeepSeek-R1 example.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example:

"Create a concise high-level summary as a bullet-point list. Then include a Markdown table showing lines added and removed by each contributing author."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
examples/models/core/deepseek_v3/README.md (2)

245-245: Add brief comment explaining enable_block_reuse: false for clarity.

The enable_block_reuse: false setting aligns with the PR objective to disable KV reuse (documented at line 838). However, users reading these example configurations may benefit from an inline comment explaining why this is disabled and when they should consider using it. Consider adding a comment like:

cuda_graph_config:
  enable_padding: true
  enable_block_reuse: false  # Disabled for benchmarking stability; see KV Cache Reuse section
  max_batch_size: 1024

309-309: Align trtllm-serve default with configuration examples.

Increasing --max_batch_size from 1024 to 2048 aligns with the FP4 max-throughput config change (line 260). However, this is now the default for the API server example, which may surprise users expecting lower default batch sizes for latency-sensitive workloads. Consider clarifying in nearby documentation that users should adjust --max_batch_size based on their use case (e.g., lower for latency, higher for throughput). Alternatively, add a comment in the example explaining this is a throughput-oriented default.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df41f22 and 94c0610.

📒 Files selected for processing (1)
  • examples/models/core/deepseek_v3/README.md (5 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • examples/models/core/deepseek_v3/README.md
📚 Learning: 2025-09-04T07:33:10.618Z
Learnt from: MrGeva
Repo: NVIDIA/TensorRT-LLM PR: 7219
File: tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py:162-168
Timestamp: 2025-09-04T07:33:10.618Z
Learning: When users explicitly provide cuda_graph_batch_sizes in TorchCudagraphCompiler, respect their choices and only sanitize the values (clamp, dedupe, sort) without forcing additional batch sizes like 1 or max_batch_size. Only add commonly-used batch sizes when falling back to the heuristic.

Applied to files:

  • examples/models/core/deepseek_v3/README.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
examples/models/core/deepseek_v3/README.md (4)

259-260: Verify increased batch size doesn't trigger OOM in practical use.

The max_batch_size is increased from 1024 to 2048 for the max-throughput configuration. While this supports higher throughput, the documentation at line 317 already warns of possible OOM issues with max-throughput configs. Ensure this batch size increase has been validated on B200 hardware and update the OOM mitigation guidance if needed.


273-274: Configuration is consistent with FP4 min-latency counterpart.

The enable_block_reuse: false addition here mirrors line 245, maintaining consistency across precision variants. The lower max_batch_size (1024) for min-latency is appropriate.


289-292: Verify FP8 max-throughput batch size configuration.

This section sets max_batch_size: 512, which is lower than the FP4 max-throughput setting (2048 at line 260). Verify this difference is intentional and reflects FP8's higher memory requirements. The comment at line 317 about reducing max_num_tokens to 3072 should be validated for FP8 configs as well.


150-151: Confirm intentional differences between configuration sections.

The Long context support section (ISL-64k and ISL-128k) retains explicit batch_sizes lists, while the serving configurations now use enable_block_reuse: false. This is appropriate since these sections target different use cases (benchmarking vs. serving), but it's worth confirming this split design is intentional. The consistency with the learning from PR 7219 about respecting explicit batch sizes is good here.

Also applies to: 177-177

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
@kaiyux
Copy link
Member

kaiyux commented Nov 19, 2025

/bot skip --comment "doc changes"

@kaiyux kaiyux enabled auto-merge (squash) November 19, 2025 04:32
@tensorrt-cicd
Copy link
Collaborator

PR_Github #24980 [ skip ] triggered by Bot. Commit: 3c8c0ad

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24980 [ skip ] completed with state SUCCESS. Commit: 3c8c0ad
Skipping testing for commit 3c8c0ad

@kaiyux kaiyux merged commit 255e4ea into NVIDIA:main Nov 19, 2025
5 checks passed
lkomali pushed a commit to lkomali/TensorRT-LLM that referenced this pull request Nov 19, 2025
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: lkomali <lkomali@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants