Skip to content

Conversation

@amitz-nv
Copy link
Collaborator

@amitz-nv amitz-nv commented Oct 13, 2025

Description

Cherrypick of:

See description & test coverage there.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • New Features

    • Simplified LoRA setup: automatic mapping/model-config handling and checkpoint loading without runtime_mapping.
    • Added support for fused QKV attention mapping (“attn_qkv” → “qkv_proj”) improving adapter compatibility.
    • Python-side model config conversion from C++ bindings for smoother mixed-backend workflows.
  • Refactor

    • Unified LoRA initialization across PyTorch/TRT paths via a single accessor and consistent constructor, reducing setup friction and errors.
  • Tests

    • Added Phi-3 LoRA fused-modules test ensuring TP2 outputs match TP1.
    • Introduced RPC TP2 and streaming tests (currently skipped) to expand coverage.

@amitz-nv amitz-nv self-assigned this Oct 13, 2025
@amitz-nv amitz-nv requested a review from a team as a code owner October 13, 2025 08:35
@amitz-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21182 [ run ] triggered by Bot

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 13, 2025

📝 Walkthrough

Walkthrough

The change refactors LoRA initialization and configuration across Python and C++ runners by introducing Python/CPP ModelConfig types and a conversion method, updating LoraManager’s constructor and load APIs to rely on stored mapping/model_config, and exposing a PeftCacheManager.get_lora_manager accessor. Utilities and tests were updated accordingly, including an added LayerType-to-string helper and new TP consistency tests.

Changes

Cohort / File(s) Summary of Changes
LoRA manager constructor and loaders
tensorrt_llm/lora_manager.py
LoraManager now requires keyword-only mapping and model_config; removed runtime_mapping from load methods; internal reliance on stored mapping/model_config.
Resource/PEFT managers API shift
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Introduced ModelConfigPython (Python alias) and ModelConfigCpp; updated signatures to use ModelConfigCpp; added get_lora_manager; PEFT init passes Python-side model_config via ModelConfigPython.from_model_config_cpp(...).
Runtime runners LoRA wiring
tensorrt_llm/runtime/model_runner.py, tensorrt_llm/runtime/model_runner_cpp.py, tensorrt_llm/runtime/enc_dec_model_runner.py, tensorrt_llm/executor/base_worker.py
Replaced LoraManager() calls with LoraManager(mapping=..., model_config=...); removed runtime_mapping in load calls; in some paths obtain manager via peft_cache_manager.get_lora_manager(); C++ runner maps CPP model config to Python via new converter.
ModelConfig conversion
tensorrt_llm/runtime/generation.py
Added ModelConfig.from_model_config_cpp to construct a Python ModelConfig from a C++ binding, using dtype/layer-type converters.
Utils
tensorrt_llm/_utils.py
Added binding_layer_type_to_str(LayerType) -> str; imported LayerType.
LoRA helper mapping
tensorrt_llm/lora_helper.py
Extended default module mapping: added "attn_qkv""qkv_proj".
Tests: LoRA TP consistency and RPC
tests/unittest/llmapi/lora_test_utils.py, tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
Added Phi-3 fused-modules LoRA TP1 vs TP2 test utilities and test; added RPC TP2 tests (skipped); minor typing change to List[int]; introduced LoraConfig usage.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Runner as ModelRunner/Worker
  participant RM as ResourceManager/PEFT
  participant LM as LoraManager
  participant CPP as ModelConfigCpp
  participant PyMC as ModelConfigPython

  rect rgb(245,245,255)
    note over Runner,RM: Initialization
    Runner->>RM: init_peft(model_config: CPP)
    RM->>PyMC: from_model_config_cpp(CPP)
    RM->>LM: construct(mapping, model_config: PyMC, cpp_peft_cache_manager?)
    RM-->>Runner: get_lora_manager()
  end

  rect rgb(245,255,245)
    note over Runner,LM: Load adapters
    Runner->>LM: load_from_ckpt(model_dirs_or_files, model_config, uids?, ckpt_source)
    LM->>LM: use internal mapping/model_config (no runtime_mapping)
    LM-->>Runner: adapters loaded
  end
Loading
sequenceDiagram
  autonumber
  participant Enc as EncRunner
  participant Dec as DecRunner
  participant LM_E as LoraManager(Enc)
  participant LM_D as LoraManager(Dec)

  note over Enc,LM_E: Encoder path
  Enc->>LM_E: construct(mapping_enc, model_config_enc)
  Enc->>LM_E: load_from_hf(model_dirs_enc, model_config_enc, uids?, component?)
  LM_E-->>Enc: ready

  note over Dec,LM_D: Decoder path
  Dec->>LM_D: construct(mapping_dec, model_config_dec)
  Dec->>LM_D: load_from_hf(model_dirs_dec, model_config_dec, uids?, component?)
  LM_D-->>Dec: ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • Superjomn
  • shaharmor98
  • schetlur-nv

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The pull request description includes the required section headings but does not provide the necessary content: the Description section merely references a cherry‐pick and does not summarize the issue or the proposed fix, and the Test Coverage section is blank so no relevant tests are identified. Please expand the Description section to include a concise explanation of the problem being fixed and the solution applied, populate the Test Coverage section with the specific tests that validate the changes, and ensure all sections of the provided template are fully completed.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title succinctly summarizes the main change by indicating a fix for PyTorch and TRT-Python flows related to fused LoRA adapter module weight splitting when tensor parallelism is greater than one.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/lora_manager.py (1)

860-868: Guard against empty adapter weights in NeMo path

max() on an empty list raises ValueError if no supported weights were found for a UID. Add a safe early-exit.

-            max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid])
+            if len(self._cpp_lora_weights[uid]) == 0:
+                warnings.warn(f"No supported NeMo LoRA weights found in {model_file}; skipping uid {uid}")
+                del self._lora_uid_to_low_ranks[uid]
+                del self._cpp_lora_weights[uid]
+                del self._cpp_lora_config[uid]
+                return
+            max_weight_size = max(w.size(0) for w in self._cpp_lora_weights[uid])
♻️ Duplicate comments (1)
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (1)

90-104: Same missing imports as the previous test.

This async test has the same undefined names as test_llm_rpc_tp2. The imports suggested in the previous comment will fix this test as well.

🧹 Nitpick comments (4)
tests/unittest/llmapi/lora_test_utils.py (1)

29-29: Remove unused noqa directives.

The # noqa: D205 directives on lines 29 and 56 are flagged as unused by the linter. These can be safely removed as the D205 rule (multi-line docstring summary should be on one line) is not enabled.

Apply this diff:

-    """  # noqa: D205
+    """

Also applies to: 56-56

tensorrt_llm/executor/base_worker.py (1)

221-223: Harden against missing PEFT cache manager

If LoRA is enabled but PEFT cache manager wasn't provisioned, get_lora_manager() will raise obscure errors. Add a clear guard.

-            peft_cache_manager = self.engine.resource_manager.resource_managers.get(
-                ResourceManagerType.PEFT_CACHE_MANAGER)
-            self._lora_manager = peft_cache_manager.get_lora_manager()
+            peft_cache_manager = self.engine.resource_manager.resource_managers.get(
+                ResourceManagerType.PEFT_CACHE_MANAGER)
+            if peft_cache_manager is None:
+                raise RuntimeError(
+                    "LoRA is enabled but PEFT cache manager is not available in the resource manager."
+                )
+            self._lora_manager = peft_cache_manager.get_lora_manager()
tensorrt_llm/lora_manager.py (2)

976-1008: Fused-weight interleaver: solid logic; replace asserts and drop unused noqa

The interleave logic and validation are correct for attn_qkv and TP>1. Replace bare asserts with explicit exceptions and remove the unused noqa.

Apply:

-            """  # noqa: D205
+            """
-            assert weight.shape[rank_dim] == sum(part_sizes)
+            if weight.shape[rank_dim] != sum(part_sizes):
+                raise ValueError(
+                    f"Unexpected fused LoRA weight shape {tuple(weight.shape)} along dim {rank_dim}; "
+                    f"expected sum(part_sizes)={sum(part_sizes)}"
+                )
@@
-            for i in range(len(part_sizes)):
-                assert weight_parts[i].shape[rank_dim] % tp_size == 0
+            for i in range(len(part_sizes)):
+                if weight_parts[i].shape[rank_dim] % tp_size != 0:
+                    raise ValueError(
+                        f"Fused part index {i} size {weight_parts[i].shape[rank_dim]} "
+                        f"is not divisible by tp_size={tp_size}"
+                    )

Based on learnings


1-1: Add NVIDIA Apache-2.0 header

This source file is missing the standard header. Please prepend it per coding guidelines.

+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

As per coding guidelines

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ad0e91a and b5bdb86.

📒 Files selected for processing (11)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (8 hunks)
  • tensorrt_llm/_utils.py (2 hunks)
  • tensorrt_llm/executor/base_worker.py (2 hunks)
  • tensorrt_llm/lora_helper.py (1 hunks)
  • tensorrt_llm/lora_manager.py (5 hunks)
  • tensorrt_llm/runtime/enc_dec_model_runner.py (2 hunks)
  • tensorrt_llm/runtime/generation.py (2 hunks)
  • tensorrt_llm/runtime/model_runner.py (2 hunks)
  • tensorrt_llm/runtime/model_runner_cpp.py (2 hunks)
  • tests/unittest/llmapi/lora_test_utils.py (2 hunks)
  • tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/lora_helper.py
  • tensorrt_llm/_utils.py
  • tensorrt_llm/runtime/model_runner_cpp.py
  • tensorrt_llm/runtime/model_runner.py
  • tensorrt_llm/executor/base_worker.py
  • tensorrt_llm/runtime/generation.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/runtime/enc_dec_model_runner.py
  • tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
  • tests/unittest/llmapi/lora_test_utils.py
  • tensorrt_llm/lora_manager.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/lora_helper.py
  • tensorrt_llm/_utils.py
  • tensorrt_llm/runtime/model_runner_cpp.py
  • tensorrt_llm/runtime/model_runner.py
  • tensorrt_llm/executor/base_worker.py
  • tensorrt_llm/runtime/generation.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/runtime/enc_dec_model_runner.py
  • tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
  • tests/unittest/llmapi/lora_test_utils.py
  • tensorrt_llm/lora_manager.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/lora_helper.py
  • tensorrt_llm/_utils.py
  • tensorrt_llm/runtime/model_runner_cpp.py
  • tensorrt_llm/runtime/model_runner.py
  • tensorrt_llm/executor/base_worker.py
  • tensorrt_llm/runtime/generation.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/runtime/enc_dec_model_runner.py
  • tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
  • tests/unittest/llmapi/lora_test_utils.py
  • tensorrt_llm/lora_manager.py
🧠 Learnings (5)
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
PR: NVIDIA/TensorRT-LLM#7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

  • tensorrt_llm/lora_manager.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

  • tensorrt_llm/lora_manager.py
🧬 Code graph analysis (9)
tensorrt_llm/_utils.py (1)
cpp/include/tensorrt_llm/runtime/modelConfig.h (1)
  • LayerType (60-183)
tensorrt_llm/runtime/model_runner_cpp.py (2)
tensorrt_llm/runtime/generation.py (2)
  • ModelConfig (609-690)
  • from_model_config_cpp (658-690)
tensorrt_llm/lora_manager.py (1)
  • LoraManager (640-1269)
tensorrt_llm/executor/base_worker.py (2)
tensorrt_llm/lora_manager.py (1)
  • LoraManager (640-1269)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
  • get_lora_manager (1156-1157)
tensorrt_llm/runtime/generation.py (1)
tensorrt_llm/_utils.py (4)
  • binding_layer_type_to_str (200-201)
  • binding_to_str_dtype (204-207)
  • dtype (955-956)
  • dtype (963-973)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
tensorrt_llm/runtime/generation.py (2)
  • ModelConfig (609-690)
  • from_model_config_cpp (658-690)
tensorrt_llm/mapping.py (3)
  • Mapping (32-519)
  • rank (328-329)
  • rank (332-339)
tensorrt_llm/lora_manager.py (1)
  • LoraManager (640-1269)
tensorrt_llm/runtime/enc_dec_model_runner.py (2)
tensorrt_llm/lora_manager.py (1)
  • LoraManager (640-1269)
tensorrt_llm/runtime/model_runner.py (1)
  • mapping (821-822)
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (3)
tests/unittest/llmapi/lora_test_utils.py (1)
  • check_phi3_lora_fused_modules_output_tp2_identical_to_tp1 (51-65)
tensorrt_llm/llmapi/llm.py (2)
  • LLM (1052-1068)
  • generate (241-319)
tensorrt_llm/llmapi/llm_args.py (1)
  • KvCacheConfig (976-1110)
tests/unittest/llmapi/lora_test_utils.py (2)
tensorrt_llm/lora_helper.py (1)
  • LoraConfig (84-103)
tensorrt_llm/executor/request.py (1)
  • LoRARequest (24-53)
tensorrt_llm/lora_manager.py (3)
tensorrt_llm/mapping.py (1)
  • Mapping (32-519)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
  • PeftCacheManager (1103-1221)
tensorrt_llm/runtime/generation.py (2)
  • head_size (1237-1238)
  • num_heads (1187-1188)
🪛 Ruff (0.13.3)
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py

72-72: Undefined name skip_ray

(F821)


75-75: Undefined name llama_model_path

(F821)


79-79: Undefined name GenerationExecutorRpcProxy

(F821)


82-82: Undefined name SamplingParams

(F821)


91-91: Undefined name skip_ray

(F821)


95-95: Undefined name llama_model_path

(F821)


99-99: Undefined name GenerationExecutorRpcProxy

(F821)


102-102: Undefined name SamplingParams

(F821)

tests/unittest/llmapi/lora_test_utils.py

29-29: Unused noqa directive (non-enabled: D205)

Remove unused noqa directive

(RUF100)


56-56: Unused noqa directive (non-enabled: D205)

Remove unused noqa directive

(RUF100)

tensorrt_llm/lora_manager.py

983-983: Unused noqa directive (non-enabled: D205)

Remove unused noqa directive

(RUF100)


1016-1016: Unused noqa directive (non-enabled: D205)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (22)
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py (2)

8-8: LGTM!

The import of the new test utility function is correctly added and will be used by the new test on line 64.


62-68: LGTM!

The test function correctly validates that LoRA fused module outputs are identical between TP=1 and TP=2, which is exactly what this PR aims to fix. The CUDA graph is appropriately disabled with a TODO comment noting the need for a proper fix.

tests/unittest/llmapi/lora_test_utils.py (6)

5-5: LGTM!

The List import is correctly added to support the type annotation changes in this file.


14-14: LGTM!

The LoraConfig import is correctly added and used in the new helper function _generate_phi3_response_lora_fused_modules.


16-20: LGTM!

The Russian language prompts are appropriate test data for the Phi-3 Russian LoRA adapter being tested.


23-48: LGTM!

The helper function correctly:

  • Sets up the Phi-3 model and Russian LoRA adapter paths
  • Configures LoRA with appropriate parameters (rank=16, max_loras=2)
  • Creates LoRA requests for all prompts
  • Generates outputs and returns the text

The implementation properly tests fused attention QKV and fused MLP gate up proj modules, which is the core functionality being fixed in this PR.


51-65: LGTM!

The test utility correctly validates that outputs are identical between TP=1 and TP=2, which verifies the fix for fused LoRA adapter modules weight split with tensor parallelism. The implementation:

  • Runs generation with TP=1
  • Runs generation with TP=2
  • Asserts outputs are equal

This directly addresses the PR objective.


69-69: Type annotation consistency improved.

Changing from list[int] to List[int] maintains consistency with the imported List type from typing (line 5) and ensures Python 3.8+ compatibility as required by the coding guidelines.

Based on coding guidelines: Python code must target Python 3.8+, and the generic list[int] syntax was introduced in Python 3.9.

tensorrt_llm/lora_helper.py (1)

49-49: LGTM!

The addition of the "attn_qkv": "qkv_proj" mapping entry is correct and consistent with existing patterns in the mapping dictionary. This enables support for fused QKV LoRA modules as intended by the PR.

tensorrt_llm/runtime/model_runner.py (2)

614-619: LGTM!

The LoraManager instantiation and load_from_ckpt call have been correctly updated to align with the new API:

  • LoraManager now receives explicit mapping and model_config parameters
  • load_from_ckpt no longer requires runtime_mapping parameter

These changes are consistent with the broader refactoring of LoRA configuration management.


723-728: LGTM!

The LoraManager instantiation and load_from_ckpt call follow the same pattern as in the from_engine method, maintaining consistency across both initialization paths.

tensorrt_llm/_utils.py (1)

44-44: LGTM!

The LayerType import is correctly added to support the new helper function.

tensorrt_llm/runtime/enc_dec_model_runner.py (2)

177-186: LGTM!

The encoder LoraManager initialization correctly uses the updated API with explicit mapping and model_config parameters. The subsequent load_from_hf call properly omits the runtime_mapping parameter, aligning with the refactored API signature.


202-211: LGTM!

The decoder LoraManager initialization mirrors the encoder pattern, maintaining consistency. Both use the appropriate runtime mapping and model config for their respective components.

tensorrt_llm/runtime/generation.py (2)

43-45: Imports look good

New utilities are correctly imported for CPP-to-Python conversion.


657-691: Fix broken line continuation for max_prompt_embedding_table_size
Replace the split lines with

max_prompt_embedding_table_size=model_config_cpp.max_prompt_embedding_table_size,

Also verify that using model_config_cpp.num_layers() vs. a property and model_config_cpp.num_kv_heads(0) fallback semantics align with the C++ bindings.

tensorrt_llm/executor/base_worker.py (1)

208-211: TRT path: LoraManager construction aligns with new API

Instantiating LoraManager with mapping and runtime model_config is correct for TRT-Python flow. Given TP>1 is central to this fix, please confirm engine_config.pretrained_config.mapping reflects the runtime TP/PP/CP as expected.

tensorrt_llm/lora_manager.py (2)

663-669: Constructor/API refactor looks good

Storing mapping/model_config on the instance and switching to keyword-only params aligns the class with new call sites.

Also applies to: 714-716


739-768: load_from_ckpt signature change is consistent

Dropping runtime_mapping and routing via self for TP info is correct and aligns with updated call sites.

tensorrt_llm/runtime/model_runner_cpp.py (1)

281-285: LoRA manager construction aligned with new API

Passing Mapping and a Python ModelConfig converted from C++ bindings is correct.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)

15-15: ModelConfig aliasing is appropriate

Using ModelConfigPython and ModelConfigCpp aliases clarifies intent across Python/CPP boundaries.

Also applies to: 35-36


1108-1157: PEFT: construct and expose LoraManager correctly

Mapping from WorldConfig and passing a Python-side ModelConfig to LoraManager is correct; get_lora_manager accessor is useful for higher-level code.

@amitz-nv
Copy link
Collaborator Author

/bot run

@amitz-nv amitz-nv requested a review from byshiue October 13, 2025 09:42
@tensorrt-cicd
Copy link
Collaborator

PR_Github #21192 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21182 [ run ] completed with state ABORTED
LLM/release-1.1/L0_MergeRequest_PR #112 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21192 [ run ] completed with state SUCCESS
/LLM/release-1.1/L0_MergeRequest_PR pipeline #116 completed with status: 'FAILURE'

@amitz-nv amitz-nv enabled auto-merge (squash) October 13, 2025 13:34
@amitz-nv amitz-nv force-pushed the release-1-1-fix-lora-fused-modules-weights branch from fb883fc to 44072cd Compare October 13, 2025 13:37
@amitz-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21223 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21223 [ run ] completed with state SUCCESS
/LLM/release-1.1/L0_MergeRequest_PR pipeline #121 completed with status: 'FAILURE'

@amitz-nv
Copy link
Collaborator Author

/bot run

…RA adapter modules weight split with TP>1 (NVIDIA#8063)

Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
…pick

Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
@amitz-nv amitz-nv force-pushed the release-1-1-fix-lora-fused-modules-weights branch from 44072cd to ef3b7c8 Compare October 15, 2025 07:06
@tensorrt-cicd
Copy link
Collaborator

PR_Github #21442 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21442 [ run ] completed with state SUCCESS
/LLM/release-1.1/L0_MergeRequest_PR pipeline #145 completed with status: 'FAILURE'

@amitz-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21460 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21460 [ run ] completed with state SUCCESS
/LLM/release-1.1/L0_MergeRequest_PR pipeline #148 completed with status: 'SUCCESS'

@amitz-nv amitz-nv merged commit 27c6c84 into NVIDIA:release/1.1 Oct 15, 2025
5 checks passed
mikeiovine pushed a commit to mikeiovine/TensorRT-LLM that referenced this pull request Oct 20, 2025
…RA adapter modules weight split with TP>1 (NVIDIA#8313)

Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
mikeiovine pushed a commit to mikeiovine/TensorRT-LLM that referenced this pull request Oct 20, 2025
…RA adapter modules weight split with TP>1 (NVIDIA#8313)

Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants