Skip to content

Conversation

@dominicshanshan
Copy link
Collaborator

@dominicshanshan dominicshanshan commented Sep 9, 2025

Summary by CodeRabbit

  • New Features

    • Multi-GPU-aware model download and distributed build for faster, more reliable multi-node setups.
  • Bug Fixes

    • Improved stability by clearing internal buffers before reuse.
    • More robust shared-memory handling and configurable name via environment variable to avoid collisions.
  • Documentation

    • Added FP4 support overview and updated precision support for Blackwell GPUs.
    • Improved LLM API troubleshooting, including guidance for hangs with docker --net=host and refined tips structure.
  • Tests

    • Expanded multi-node coverage, added/updated MOE and BF16 cases, adjusted timeouts, and enabled/disabled specific tests as appropriate.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 9, 2025

📝 Walkthrough

Walkthrough

Documentation updates add FP4 and Blackwell details and troubleshooting fixes. CI/test lists adjusted for GB200 and RTX PRO 6000. Core changes include buffer zeroing in attention, SHM/env handling in MoE load balancer, lazy dummy-KV initialization and warning tweak in resource management, distributed HF download/build in LLM API, Gemma normalization params update, and several test additions/skips.

Changes

Cohort / File(s) Summary
Docs: LLM API and Troubleshooting
docs/source/llm-api/index.md
Fix quickstart path; promote bullets to H3; add docker --net=host hanging section with mpi4py notes and workarounds.
Docs: FP4/Blackwell Support
docs/source/overview.md, docs/source/reference/support-matrix.md
Add FP4 support note in overview; add Blackwell (SM100/SM120) with FP32/FP16/BF16/FP8/FP4/INT8/INT4 to precision entries and list-table.
Attention backend buffer init
tensorrt_llm/_torch/attention_backend/trtllm.py
Zero kv_block_ids_per_seq and block_ids_per_seq before copying slices in prepare_flash_mla.
MoE load balancer SHM/env and signature
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
Derive SHM base name from arg or TRTLLM_EPLB_SHM_NAME; on FileExistsError, unlink and recreate SHM; constructor signature adjusted to handle env-driven base name.
PyExecutor: lazy dummy KV + SWA sync
tensorrt_llm/_torch/pyexecutor/_util.py
Defer dummy request creation; lazy init based on current max seq len; regenerate on SWA-induced decreases; enforce min length 1.
Resource manager warning detail
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Warning uses computed max_tokens and appends memory context (free_mem/total_mem) when both fraction and max_tokens set.
LLM API distributed HF ops
tensorrt_llm/llmapi/llm_utils.py
Add _submit_to_all_workers; add _node_download_hf_model and _node_build_task; use MPI-aware download/build for HF models; keep single-GPU behavior unchanged.
Gemma normalization params
tensorrt_llm/models/gemma/model.py
Pass norm_pre_residual_weight to attention; make eps conditional on inter-layer norms.
CI: Jenkins GB200 multi-node expansion
jenkins/L0_Test.groovy
Replace 5 static Post-Merge entries with dynamic generation of 7; adjust config 4th element from 5 to 7; include in fullSet.
Integration tests config YAMLs (KV cache reuse)
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_genpp2.yaml, .../disagg_config_ctxpp2_gentp2.yaml, .../disagg_config_ctxpp4_genpp4.yaml, .../disagg_config_ctxtp2pp2_gentp2pp2.yaml
Add enable_block_reuse: False under kv_cache_config in context/generation sections as applicable.
Integration tests list updates
tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml, tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml
GB200: increase MOE test timeouts to 180; add latency_moe_trtllm_eagle3 and bf16[multi_gpus_no_cache]. RTX PRO 6000: remove Llama3.1-70B FP8 test.
Integration test: Qwen3-8B caching param
tests/integration/defs/accuracy/test_llm_api_pytorch.py
Parametrize test_bf16 with is_cached; add multi-GPU no-cache variant selecting uncached HF path.
Unit tests toggles
tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py, tests/unittest/_torch/speculative/test_eagle3.py, tests/unittest/llmapi/test_mpi_session.py
Enable DeepSeek streaming test (remove skip); skip Eagle3 test with bug ref; enable MPI session test, resolve path dynamically, stream subprocess I/O, raise on non-zero exit.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User Process
  participant C as CachedModelLoader
  participant MPI as mpi_session
  participant W as Worker Nodes

  rect rgb(240,248,255)
  note over C: HF model load (multi-GPU aware)
  U->>C: load_hf_model(model, revision)
  alt parallel_config.is_multi_gpu
    C->>MPI: submit_sync(_node_download_hf_model, model, revision)
    MPI-->>C: List[Path|None] (per-rank results)
    C->>C: select first non-None Path
  else single GPU
    C->>C: _node_download_hf_model(model, revision)
  end
  C-->>U: hf_model_dir Path
  end

  rect rgb(245,255,250)
  note over C: Per-node engine build orchestration
  U->>C: build(model_args,...)
  alt multi-GPU
    C->>MPI: submit_sync(_node_build_task, llm_args,...)
    MPI-->>C: List[build_steps_info]
  else single GPU
    C->>C: _node_build_task(llm_args,...)
  end
  C-->>U: build result(s)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

weekly_release_blocker

Suggested reviewers

  • syuoni
  • QiJune
  • litaotju
  • chzblych
  • yuxianq

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Pre-merge checks (1 passed, 2 warnings)

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The pull request description remains populated by the commented template placeholders without any actual title, description text, or test coverage details, and it does not include the required filled-out sections or CodeRabbit AI summary marker, making it largely incomplete. Please add a top-level PR title following the “[JIRA ticket/…][type] Summary” format (or use “[None] @coderabbitai title”), then provide a concise “## Description” explaining what and why, list relevant tests under “## Test Coverage,” and ensure the PR Checklist items are reviewed and appropriately marked.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title follows the prescribed [None][chore] template and succinctly conveys the main intent of integrating release/1.0 changes into the main branch.
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_gentp2.yaml (1)

1-38: Explicitly set enable_block_reuse in all kv_cache_config blocks
Add enable_block_reuse: False under the generation_servers.kv_cache_config in
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_gentp2.yaml, and audit the other disagg test configs to ensure every kv_cache_config block includes an explicit enable_block_reuse key.

tensorrt_llm/_torch/pyexecutor/_util.py (1)

170-179: Avoid double-counting draft tokens in KV capacity estimation.

spec_cfg.max_draft_len is added twice (once under overlap-scheduler branch and again unconditionally), which inflates KV estimates and may reduce usable capacity.

-        if not pytorch_backend_config.disable_overlap_scheduler:
-            num_extra_tokens_per_seq = num_extra_tokens_per_seq + 1
-            if spec_cfg is not None:
-                num_extra_tokens_per_seq += spec_cfg.max_draft_len
-
-        if spec_cfg is not None:
-            num_extra_tokens_per_seq += spec_cfg.max_draft_len
-            num_extra_tokens_per_seq += get_num_extra_kv_tokens(spec_cfg)
+        if not pytorch_backend_config.disable_overlap_scheduler:
+            num_extra_tokens_per_seq += 1
+
+        if spec_cfg is not None:
+            num_extra_tokens_per_seq += spec_cfg.max_draft_len
+            num_extra_tokens_per_seq += get_num_extra_kv_tokens(spec_cfg)
🧹 Nitpick comments (14)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

555-558: Validate fraction range.

Consider asserting lower bound too to catch 0 or negative values: 0.0 < free_mem_fraction < 1.0.

-        assert free_mem_fraction < 1.0, f"Invalid freeMemFraction, freeMemFraction {free_mem_fraction} must be smaller than 1.0"
+        assert 0.0 < free_mem_fraction < 1.0, (
+            f"Invalid freeMemFraction: {free_mem_fraction}. Must be in (0.0, 1.0)."
+        )
docs/source/reference/support-matrix.md (1)

162-162: SM designations are valid; small terminology tweak optional.

“Blackwell (SM100/SM120)” aligns with CUDA 12.8 notes (SM_100/SM_120). Consider using “NVFP4” instead of “FP4” for precision naming consistency with code. (docs.nvidia.com)

docs/source/overview.md (1)

27-29: Nit: spacing and terminology.

  • Remove the stray space before the comma after the B200 link.
  • Prefer “NVFP4” (brand) once, then “FP4” thereafter, to match code/constants.
- [NVIDIA B200 GPUs](https://www.nvidia.com/en-us/data-center/dgx-b200/) , when used with TensorRT-LLM, enable seamless loading of model weights in the new [FP4 format](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/#what_is_nvfp4), allowing you to automatically leverage optimized FP4 kernels for efficient and accurate low-precision inference.
+ [NVIDIA B200 GPUs](https://www.nvidia.com/en-us/data-center/dgx-b200/), when used with TensorRT-LLM, enable seamless loading of model weights in the new [NVFP4 format](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/#what_is_nvfp4), allowing you to automatically leverage optimized FP4 kernels for efficient and accurate low-precision inference.
tensorrt_llm/_torch/attention_backend/trtllm.py (1)

846-851: Zero only the written slices to avoid unnecessary work.

Full-buffer fill_ can be sizable; zero the relevant regions instead and prefer zero_() for clarity.

-        self.kv_block_ids_per_seq.fill_(0)
-        self.kv_block_ids_per_seq[:self.num_seqs, :num_blocks].copy_(
+        self.kv_block_ids_per_seq[:self.num_seqs, :num_blocks].zero_()
+        self.kv_block_ids_per_seq[:self.num_seqs, :num_blocks].copy_(
             block_ids_per_seq, non_blocking=True)
-        self.block_ids_per_seq.fill_(0)
-        self.block_ids_per_seq[:self.num_generations, :num_blocks].copy_(
+        self.block_ids_per_seq[:self.num_generations, :num_blocks].zero_()
+        self.block_ids_per_seq[:self.num_generations, :num_blocks].copy_(
             block_ids_per_seq[self.num_contexts:], non_blocking=True)
tests/unittest/_torch/speculative/test_eagle3.py (1)

331-332: Optional: avoid running unittest main when skipping heavy tests.

Since pytest drives CI, dropping the __main__ block reduces accidental local execution. Keep if needed for local workflows.

tensorrt_llm/models/gemma/model.py (1)

141-146: Guard tuple-unpacking of hidden_states under reduce_fusion.

The FIXME hints potential shape mismatches. Add a defensive check to avoid crashes when upstream doesn’t return a (hidden_states, residual) tuple for some ranks/configs.

-        if default_net(
-        ).plugin_config.reduce_fusion and self.local_layer_idx > 0:
-            hidden_states, residual = hidden_states  #FIXME:AN need to check if appropriate residual value is hidden state is pulled out.
+        if default_net().plugin_config.reduce_fusion and self.local_layer_idx > 0:
+            if isinstance(hidden_states, tuple):
+                hidden_states, residual = hidden_states
+            else:
+                # Fallback to non-fused contract
+                residual = hidden_states
+                hidden_states = self.input_layernorm(hidden_states)
         else:
             residual = hidden_states
             hidden_states = self.input_layernorm(hidden_states)
tests/unittest/llmapi/test_mpi_session.py (2)

60-66: Resolve path once; reuse for cwd to avoid recomputing.

Minor cleanup: reuse cur_dir for cwd instead of recomputing os.path.dirname(os.path.abspath(__file__)).

-    with Popen(command,
+    with Popen(command,
                env=os.environ,
                stdout=PIPE,
                stderr=PIPE,
                bufsize=1,
                start_new_session=True,
                universal_newlines=True,
-               cwd=os.path.dirname(os.path.abspath(__file__))) as process:
+               cwd=cur_dir) as process:

62-62: Silence Ruff S101 for test assertion (intended in tests).

Keep the fast-fail assert but add a local noqa to satisfy Ruff.

-    assert os.path.exists(test_file), f"Test file {test_file} does not exist"
+    assert os.path.exists(test_file), f"Test file {test_file} does not exist"  # noqa: S101
tensorrt_llm/_torch/pyexecutor/_util.py (1)

181-184: Lazy-init dummy requests here is good; also guard in configure_kv_cache_capacity.

Some callers might jump straight to configure_kv_cache_capacity(); add a safety init there to avoid None usage.

@@
-        py_executor.set_gather_responses(True)
+        # Ensure dummy requests exist even if try_prepare_estimation() was skipped.
+        if self._dummy_reqs is None:
+            self._dummy_reqs = self._create_dummy_context_requests(
+                max(1, self._max_seq_len - 1)
+            )
+        py_executor.set_gather_responses(True)
docs/source/llm-api/index.md (3)

68-68: Fix markdownlint MD026: remove trailing punctuation from heading

Drop the period at the end of this heading to satisfy MD026.

-### MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.
+### MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1

78-79: Grammar/typo fixes in guidance

s/an contextmanager/a context manager/ and s/methed/method/.

-  2. Use LLM as an contextmanager, with the following code: `with LLM(...) as llm: ...`, the shutdown methed will be invoked automatically once it goes out of the `with`-statement block.
+  2. Use LLM as a context manager, with the following code: `with LLM(...) as llm: ...`. The shutdown method will be invoked automatically once it exits the `with` block.

82-93: Clarify the docker networking workaround scope

The mpi4py workaround suggests avoiding --net=host; note explicitly that --ipc=host is the required flag, and keep --net default. This helps users not set both.

-The root cause may be related to `mpi4py`. There is a [workaround](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) suggesting a change from `--net=host` to `--ipc=host`, or setting the following environment variables:
+The root cause may be related to `mpi4py`. A documented [workaround](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) is to avoid `--net=host` and use `--ipc=host` instead, or set the following environment variables:
jenkins/L0_Test.groovy (1)

2105-2107: Avoid magic number 7 for post-merge splits

Make split count a named constant to keep Jenkins and test-db YAML in sync and ease future changes.

-    multiNodesSBSAConfigs += (1..7).collectEntries { i ->
-        ["GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-${i}".toString(), ["gb200-multi-node", "l0_gb200_multi_nodes", i, 7, 8, 2]]
-    }
+    // Use a single source of truth for split count.
+    final int POST_MERGE_SPLITS = 7
+    multiNodesSBSAConfigs += (1..POST_MERGE_SPLITS).collectEntries { i ->
+        ["GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-${i}".toString(),
+         ["gb200-multi-node", "l0_gb200_multi_nodes", i, POST_MERGE_SPLITS, 8, 2]]
+    }

If preferred, I can place POST_MERGE_SPLITS next to other @field constants.

tensorrt_llm/llmapi/llm_utils.py (1)

769-771: Don’t delete attributes from llm_args

Deleting the mpi_session attribute mutates the schema and risks surprises for later users. Prefer setting it to None.

-                    if self.llm_args.mpi_session:
-                        del self.llm_args.mpi_session
+                    if getattr(self.llm_args, "mpi_session", None) is not None:
+                        self.llm_args.mpi_session = None
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c53d181 and 19d6afa.

📒 Files selected for processing (20)
  • docs/source/llm-api/index.md (2 hunks)
  • docs/source/overview.md (1 hunks)
  • docs/source/reference/support-matrix.md (1 hunks)
  • jenkins/L0_Test.groovy (1 hunks)
  • tensorrt_llm/_torch/attention_backend/trtllm.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (4 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (1 hunks)
  • tensorrt_llm/llmapi/llm_utils.py (4 hunks)
  • tensorrt_llm/models/gemma/model.py (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_genpp2.yaml (2 hunks)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_gentp2.yaml (1 hunks)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_genpp4.yaml (2 hunks)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2pp2_gentp2pp2.yaml (2 hunks)
  • tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml (1 hunks)
  • tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml (0 hunks)
  • tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py (0 hunks)
  • tests/unittest/_torch/speculative/test_eagle3.py (1 hunks)
  • tests/unittest/llmapi/test_mpi_session.py (1 hunks)
💤 Files with no reviewable changes (2)
  • tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py
  • tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tests/unittest/_torch/speculative/test_eagle3.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tests/unittest/llmapi/test_mpi_session.py
  • tensorrt_llm/models/gemma/model.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/llmapi/llm_utils.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tests/unittest/_torch/speculative/test_eagle3.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tests/unittest/llmapi/test_mpi_session.py
  • tensorrt_llm/models/gemma/model.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/llmapi/llm_utils.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tests/unittest/_torch/speculative/test_eagle3.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tests/unittest/llmapi/test_mpi_session.py
  • tensorrt_llm/models/gemma/model.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/llmapi/llm_utils.py
🧠 Learnings (6)
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
PR: NVIDIA/TensorRT-LLM#7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

  • tests/unittest/_torch/speculative/test_eagle3.py
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • tensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

  • docs/source/overview.md
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-13T11:07:11.772Z
Learnt from: Funatiq
PR: NVIDIA/TensorRT-LLM#6754
File: tests/integration/test_lists/test-db/l0_a30.yml:41-47
Timestamp: 2025-08-13T11:07:11.772Z
Learning: In TensorRT-LLM test configuration files like tests/integration/test_lists/test-db/l0_a30.yml, TIMEOUT values are specified in minutes, not seconds.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml
🧬 Code graph analysis (5)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
tensorrt_llm/_torch/attention_backend/interface.py (1)
  • num_seqs (259-263)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (2)
tensorrt_llm/_torch/expert_statistic.py (1)
  • create (18-29)
tensorrt_llm/logger.py (1)
  • warning (132-133)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
  • max_seq_len (561-571)
  • max_seq_len (574-578)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)
tensorrt_llm/llmapi/build_cache.py (1)
  • is_cached (256-271)
tensorrt_llm/llmapi/llm_args.py (1)
  • CudaGraphConfig (108-165)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1017-1033)
tests/integration/defs/conftest.py (1)
  • llm_models_root (77-82)
tensorrt_llm/llmapi/llm_utils.py (3)
tensorrt_llm/_utils.py (6)
  • global_mpi_rank (497-498)
  • local_mpi_rank (509-510)
  • mpi_barrier (526-528)
  • mpi_broadcast (536-537)
  • mpi_rank (493-494)
  • release_gc (678-685)
tensorrt_llm/llmapi/llm_args.py (2)
  • parallel_config (1410-1411)
  • is_multi_gpu (279-280)
tensorrt_llm/llmapi/utils.py (2)
  • print_traceback_on_error (31-42)
  • download_hf_model (173-182)
🪛 Ruff (0.12.2)
tests/unittest/llmapi/test_mpi_session.py

62-62: Use of assert detected

(S101)

🪛 markdownlint-cli2 (0.17.2)
docs/source/llm-api/index.md

68-68: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)

🔇 Additional comments (11)
tests/unittest/_torch/speculative/test_eagle3.py (1)

19-19: Skip marker placement looks good; preserves CI stability.

The @pytest.mark.skip is correctly above parametrization and will skip the entire test as intended. Based on prior CI behavior, skip markers remain honored even with a __main__ block.
Using retrieved learning from #7370.

tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_gentp2.yaml (1)

17-17: Block reuse explicitly disabled only on context side; confirm intent.

Generation side lacks enable_block_reuse. For parity and predictable behavior across roles, consider setting it explicitly there as well (as done in other configs), or document why it differs.

If parity is desired, add under generation_servers.kv_cache_config:

   kv_cache_config:
     free_gpu_memory_fraction: 0.2
     enable_partial_reuse: False
+    enable_block_reuse: False
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_genpp2.yaml (1)

17-17: Consistency across disagg configs looks good.

Disabling block reuse on both context and generation matches the stated pattern.

Also applies to: 33-33

tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_genpp4.yaml (1)

17-17: LGTM — explicit disable on both roles.

Matches the approach in the ctxpp2_genpp2 variant.

Also applies to: 33-33

tensorrt_llm/models/gemma/model.py (1)

160-166: Fused norm params wiring looks consistent; confirm plugin semantics.

Passing pre-FFN norm weight and the post-layernorm weight via norm_weight and norm_pre_residual_weight matches the MLP path pattern. Please confirm RESIDUAL_RMS_PREPOST_NORM expects these exact roles (pre-next-layer input vs. pre-residual), to avoid silent mis-normalization.

tensorrt_llm/_torch/pyexecutor/_util.py (2)

56-58: Good: defer dummy request creation; store max_seq_len for later sync.

This reduces constructor side effects and helps SWA workflows. LGTM.


473-479: SWA-aware re-sync of dummy requests looks correct.

Regenerating _dummy_reqs when max_seq_len shrinks avoids overlong dummy contexts post-SWA. LGTM.

tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2pp2_gentp2pp2.yaml (1)

17-17: Disabling KV block reuse: confirm intent and perf impact

Setting enable_block_reuse: False on both context/generation likely reduces memory reuse and could affect throughput. If this is a WAR for kernel gaps, add a brief YAML comment and link to the tracking issue; otherwise consider leaving it enabled in perf-oriented tests.

Apply this minimal note to document the rationale:

   kv_cache_config:
     free_gpu_memory_fraction: 0.2
     enable_partial_reuse: False
+    # WAR: disable block reuse for this scenario (see <issue/bug id>)
     enable_block_reuse: False

Also applies to: 33-33

tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml (1)

36-39: Post-merge MOE additions look good; double-check CI timing

New entries and 180-minute timeouts align with prior convention (timeouts are minutes). Please verify total wall time against the Jenkins splits introduced in L0_Test.groovy (now 7 stages) so we don’t exceed the pipeline SLA.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

2356-2367: Param adds for multi_gpus_no_cache: path selection logic LGTM; ensure HF credentials

The is_cached switch and model path fallback are correct. For the non-cached 8-GPU case, ensure HF auth tokens are present on all workers to avoid per-rank auth failures.

Also applies to: 2369-2375

tensorrt_llm/llmapi/llm_utils.py (1)

831-841: Node-scoped HF download helper: nice

Limiting downloads to local_mpi_rank()==0 reduces redundant I/O across ranks cleanly.

Copy link
Collaborator

@nv-guomingz nv-guomingz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18201 [ run ] triggered by Bot

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18306 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18306 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #13726 completed with status: 'FAILURE'

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

yizhang-nv and others added 9 commits September 19, 2025 07:49
…Qwen3 MoE multi node (NVIDIA#7724)

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…VIDIA#7753)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…ywords for multimodal model (NVIDIA#7670)

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…#7751)

Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
NVIDIA#7762)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com>
Signed-off-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
Co-authored-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…on pre-Blackwell arch (NVIDIA#7768)

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
@dominicshanshan
Copy link
Collaborator Author

/bot kill

…VIDIA#7663)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
@tensorrt-cicd
Copy link
Collaborator

PR_Github #19365 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19365 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 1fcfeeb

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19367 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19367 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #14543 (Blue Ocean) completed with status: ABORTED

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19403 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19403 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14577 completed with status: 'FAILURE'

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19419 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19419 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14588 completed with status: 'FAILURE'

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19436 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19436 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14605 completed with status: 'FAILURE'

@dominicshanshan
Copy link
Collaborator Author

/bot skip --comment "Bypass faulty oom test."

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19505 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19505 [ skip ] completed with state SUCCESS
Skipping testing for commit 1fcfeeb

@chzblych chzblych merged commit ab26d21 into NVIDIA:main Sep 22, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.