[TRTLLM-5971][feat] Integrate helix parallelism #9342

brb-nv · 2025-11-20T20:41:18Z

Description

This MR integrates helix parallelism, an experimental feature, in TRTLLM.

Background:

Helix parallelism is a decode-only context parallelism method. Hence, it's used in disaggregated setting where only gen servers would have helix.
This involves sharding the request's seqlen across multiple CP (context parallel) ranks.
For a given query token in decode phase, “local attention” is computed w.r.t previous tokens on each CP rank.
Ensuing communication among CP ranks enables “correction” of local attention such that attention computation is exact.
Given KV parallelism is applicable only to attn layer, CP GPUs are "repurposed" to TP GPUs for FFN layer.

Changes in this MR:

At a broader level, we enable helix parallelism with DeepseekV3 and add a disagg integration test (a smoke test for now).
Example to explain the core changes:
- Suppose we are dealing with the first decode step for a request with ISL 7 and gen server has two-way context parallelism i.e. cpSize=2.
- Let's say first 4 tokens reside on cpRank0 and next 3 tokens reside on cpRank1.
- We have an incoming query token, q7 (corresponding to first generated token). While we perform local attn computation wrt to q7 on both cpRanks, its KV cache is written only to one cpRank (rank1 in the example) and the kv7 is also considered in local attn only on that rank. We call this rank "active helix rank".
Known limitation: Currently only the last CP rank is considered active rank. This shall be lifted in a follow-up MR.

Most changes in this MR enforce this:

KV cache is added for query token only on active rank in resource_manager.py.
Actual KV cache write happens in mla rope kernels and changes to rope kernels skip writing KV cache on inactive ranks.
The number of tokens considered in local attn computation is determined by seq_len_kv in trtllm.py which is also adjusted accordingly.

"Repurposing" attn CP ranks to FFN TP ranks can make things quite messy. To keep this readable,

We pass mapping with CP only to the attention layers in modeling_deepseekv3.py and pass mapping without cp to the rest.
We use a similar trick in communicator.py to obtain the right TP groups.

Test Coverage

$ pytest tests/unittest/_torch/modules/test_mla_helix.py -s -v
$ TRTLLM_USE_UCX_KVCACHE=1 TLLM_LOG_LEVEL=INFO pytest tests/integration/defs/disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix -s -v

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

Release Notes

New Features
- Added context parallelism support with Helix-based distributed inference capabilities
- DeepSeekV3 model now supports context parallelism for enhanced performance on multi-GPU setups
- New --cp_size command-line argument for configuring context parallel size (default: 1)
- Enhanced disaggregated serving configuration for context-tensor parallel distribution
Tests
- Added new test configuration for disaggregated DeepSeekV3 inference with context parallelism

_{✏️ Tip: You can customize this high-level summary in your review settings.}

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py

coderabbitai · 2025-11-21T18:05:07Z

📝 Walkthrough

Walkthrough

This pull request implements context parallelism support with Helix configuration across the TensorRT-LLM inference stack. It adds per-rank inactivity tracking (helix_is_inactive_rank) to CUDA kernels and Python layers, introduces CP size configuration parameters, implements mapping repurposing logic for CP/TP distribution, and extends model initialization and executor logic to handle inactive Helix ranks during generation.

Changes

Cohort / File(s)	Summary
CUDA Kernel Signatures `cpp/tensorrt_llm/kernels/mlaKernels.cu`, `cpp/tensorrt_llm/kernels/mlaKernels.h`	Added `helix_is_inactive_rank` boolean pointer parameter to MLA rope generation kernel signatures; threaded through kernel invocations to gate token processing and K/V updates based on rank inactivity status.
Tensor Operations & Rope Generation `cpp/tensorrt_llm/thop/attentionOp.cpp`, `cpp/tensorrt_llm/thop/dsv3RopeOp.cpp`	Extended MLA tensor parameter handling to expect and forward two tensors (helix_position_offsets, helix_is_inactive_rank); added new field to MlaRopeGenArgs struct and propagated inactive rank mask through rope generation pipelines.
Torch Attention Backend `tensorrt_llm/_torch/attention_backend/trtllm.py`	Added `helix_position_offsets` and `helix_is_inactive_rank` to plan/forward/mla_rope_generation APIs; extended TrtllmAttentionMetadata with inactive rank tracking; adjusted KV length planning to exclude inactive rank contributions.
Distributed Communication `tensorrt_llm/_torch/distributed/communicator.py`	Implemented early CP communicator creation and mapping repurposing logic: when cp_size > 1, creates a copy with Helix mapping, scales TP by CP size, and restores original mapping after TP/PP communicator initialization.
Model Architecture `tensorrt_llm/_torch/models/modeling_deepseekv3.py`	Extended DeepseekV3 layer constructors with optional `mapping_with_cp` parameter; added CP rank/size extraction and weight-split logic for KV projection; implemented mapping repurposing during model initialization for cp_size > 1.
Attention Modules `tensorrt_llm/_torch/modules/attention.py`	Added `mapping_with_cp` parameter to MLA and Attention constructors; enforced num_heads equality and Helix CP type validation; updated forward paths to propagate helix parameters and support position_ids threading.
Executor & Resource Management `tensorrt_llm/_torch/pyexecutor/executor_request_queue.py`, `tensorrt_llm/_torch/pyexecutor/llm_request.py`, `tensorrt_llm/_torch/pyexecutor/model_engine.py`, `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Added `py_helix_is_inactive_rank` flag to LlmRequest; implemented helix inactive rank tracking in model engine with conditional position/token calculations; gated KV cache allocation for inactive ranks in resource manager; extended AttentionMetadata with inactive rank exposure.
CLI & Configuration `examples/llm-api/quickstart_advanced.py`, `tensorrt_llm/commands/serve.py`	Added `--cp_size` and `cp_config` command-line arguments; propagated context_parallel_size through LLM initialization; implemented cp_type string-to-enum conversion with validation.
Infrastructure & Mapping `tensorrt_llm/llmapi/disagg_utils.py`, `tensorrt_llm/mapping.py`	Updated instance rank calculation to include context_parallel_size; added hardcoded Helix CP type fallback when cp_size > 1 to override externally provided cp_config.
Test Infrastructure `tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_gentp1cp2_deepseek_v3_lite_bf16_tllm_gen.yaml`, `tests/integration/defs/disaggregated/test_disaggregated.py`	Added new disaggregated test configuration file for context TP and generation Helix setup; introduced test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix test case with model symlink setup.

Sequence Diagram(s)

sequenceDiagram
    participant Request
    participant ResourceMgr as Resource<br/>Manager
    participant ModelEngine
    participant AttentionBE as Attention<br/>Backend
    participant MLAKernel as MLA<br/>Kernel

    Request->>ResourceMgr: prepare_resources()
    activate ResourceMgr
    alt cp_size > 1 and not last rank
        ResourceMgr->>ResourceMgr: mark py_helix_is_inactive_rank=true
        ResourceMgr->>ResourceMgr: skip KV cache allocation
    else active rank
        ResourceMgr->>ResourceMgr: allocate KV cache normally
    end
    deactivate ResourceMgr

    Request->>ModelEngine: forward pass (generation)
    activate ModelEngine
    alt helix_is_inactive_rank[batch]==true
        ModelEngine->>ModelEngine: fix past_seen_token_num<br/>(no increment)
        ModelEngine->>ModelEngine: skip token processing
    else active
        ModelEngine->>ModelEngine: increment past_seen_token_num
        ModelEngine->>AttentionBE: plan() with helix params
    end
    deactivate ModelEngine

    AttentionBE->>AttentionBE: adjust kv_lens planning<br/>(exclude inactive ranks)
    AttentionBE->>MLAKernel: mla_rope_generation<br/>(helix_is_inactive_rank)
    activate MLAKernel
    alt helix_is_inactive_rank[batch]==true
        MLAKernel->>MLAKernel: skip token processing
        MLAKernel->>MLAKernel: skip K/V updates
    else active
        MLAKernel->>MLAKernel: apply rope & assign QKV
        MLAKernel->>MLAKernel: update K/V cache
    end
    deactivate MLAKernel

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

Mapping repurposing logic (communicator.py, modeling_deepseekv3.py, mapping.py): Core logic for switching between CP and TP distributions; mutations and restorations must be correctly sequenced and scoped to avoid state leaks.
KV length planning adjustments (trtllm.py, model_engine.py): Changes to how KV cache lengths are calculated when inactive ranks are present; verify accounting is correct for all rank states.
Warmup control flow (model_engine.py): Conditional position_id and past_seen_token_num calculations based on warmup state and inactivity; ensure all branches are consistent.
Cross-layer parameter threading (executor_request_queue.py, model_engine.py, resource_manager.py): helix_is_inactive_rank flows through multiple abstraction layers; verify end-to-end propagation and type conversions (bool → tensor → pointer).
Model initialization side effects (modeling_deepseekv3.py): Temporary mapping mutations during model construction; verify original mapping is reliably restored even on error paths.

Suggested reviewers

schetlur-nv
nvchenghaoz
Shixiaowei02
Superjomn
Tabrizian
Funatiq
QiJune

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 34.04% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description provides a comprehensive explanation of helix parallelism, background context, specific implementation details with an example, test coverage commands, and confirmation of the PR checklist.
Title check	✅ Passed	The PR title '[TRTLLM-5971][feat] Integrate helix parallelism' clearly and specifically describes the main change: integration of helix parallelism into TensorRT-LLM.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tensorrt_llm/mapping.py (1)
455-491: Don’t silently override cp_config to HELIX for any cp_size > 1

This block in Mapping.__init__:
#################################################################
# TODO: Remove this hardcoding and obtain cp_config from llm_args.
if cp_size > 1:
    cp_config = {"cp_type": CpType.HELIX}
#################################################################
has broad side effects:

Any caller that provides a non-Helix cp_config (e.g. STAR or ULYSSES) with cp_size > 1 now gets that configuration silently discarded and treated as HELIX.

Code that branches on cp_config["cp_type"] (e.g. _merge_requests in executor_request_queue.py, STAR attention paths, etc.) will never see CpType.STAR/ULYSSES once cp_size > 1, effectively breaking those CP modes.

Additional cp_config fields (like STAR’s block_size / cp_anchor_size, or future Helix parameters) are lost.

If the intent is “for now we only support Helix when cp_size > 1”, it’s safer to:

Only inject a default when cp_config is missing; and

Fail fast on conflicting configs instead of overriding them:
# Temporary default until cp_config is fully plumbed from llm_args.
if cp_size > 1:
    if cp_config is None:
        cp_config = {"cp_type": CpType.HELIX}
    elif cp_config.get("cp_type") != CpType.HELIX:
        raise ValueError(
            f"Only CpType.HELIX is currently supported when cp_size > 1; got {cp_config.get('cp_type')!r}"
        )
That keeps Helix as the only supported multi-CP mode in this PR, but avoids surprising behavior for existing STAR/ULYSSES configs and makes future extension to other CP types straightforward.
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
1568-1623: Tighten helix_is_inactive_rank initialization guard; verify warmup dummy request semantics for Helix

The new Helix logic is mostly sound, but there is one definite initialization bug and one edge case to confirm:
helix_is_inactive_rank initialization guard is incorrect

The current initialization at line 1568:
helix_is_inactive_rank = [] if self.mapping.cp_size > 1 else None
initializes to an empty list for all CP configurations with cp_size > 1, but has_cp_helix() returns True only when both cp_size > 1 and cp_type == CpType.HELIX. For non-Helix CP types (e.g., regular CP or other variants), this creates an empty list that never gets populated, diverging from the None state that downstream consumers expect when Helix is disabled.

Fix: Change line 1568 to:
helix_is_inactive_rank = [] if self.mapping.has_cp_helix() else None
Warmup + Helix: past_seen_token_num override semantics need verification

During warmup, you correctly skip the position_id computation (line 1605), but past_seen_token_num is unconditionally overridden based on request.orig_prompt_len (lines 1608–1612) whenever Helix is active. This is fed into num_cached_tokens_per_seq, which becomes part of KVCacheParams. For dummy warmup requests, ensure:

orig_prompt_len is consistently initialized for all dummy request types created during warmup, and

the resulting KV cache index values remain within valid bounds on inactive Helix ranks.

Per-request inactivity flag wiring looks correct

The per-beam append pattern (lines 1572–1619) produces a helix_is_inactive_rank list with length equal to the total batch size (sum of beam widths), which matches the attention backend's [batch_size] expectation.
2526-2537: Behavioral inconsistency confirmed: ULYSSES passes warmup checks but fails at runtime

The review concern is valid. I found that the change introduces a systemic breaking behavior across three PyExecutor methods:

model_engine.py._prepare_inputs (line 2536): Raises for non-STAR/HELIX

executor_request_queue.py._merge_requests (line 725): Raises for non-STAR/HELIX

py_executor.py._update_request_states (line 2072): Raises for non-STAR/HELIX

The critical inconsistency:

Warmup check (model_engine.py line 564) accepts ULYSSES and returns early

Runtime execution (line 2536) raises NotImplementedError if ULYSSES reaches _prepare_inputs

This means if someone configures PyExecutor with cp_type=ULYSSES, it will pass initialization but crash during inference

ULYSSES is defined in the CpType enum and explicitly referenced at line 564, indicating it was intended to be handled. However, no fallback path exists in the three runtime dispatch methods, and no test coverage was found for ULYSSES with PyExecutor. The previous behavior would have silently fallen through to the default _prepare_tp_inputs path.

While there's no evidence that existing code uses ULYSSES with PyExecutor, the enum inclusion and warmup-time acceptance create an expectation of support that the runtime contradicts.
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
1446-1451: Fix TP sharding after restoring the original mapping

During DeepseekV3ForCausalLM.__init__ we repurpose CP ranks into TP by installing a temporary Mapping (tp_size = tp * cp). All decoder/MTP modules capture that object via self.mapping. Later we restore model_config.mapping back to the original CP-aware mapping. Here in DeepseekV3MTP.forward, the chunking uses self.model_config.mapping.tp_size/tp_rank, which now point to the restored mapping and no longer match the row-parallel tensors created with the repurposed mapping. On Helix runs (cp_size > 1) this leaves each rank feeding the wrong slice (or no slice) into eh_proj, breaking generation.

Use the same mapping object that the layer captured during init. A minimal fix:
-        tp_size = self.model_config.mapping.tp_size
-        tp_rank = self.model_config.mapping.tp_rank
+        tp_size = self.mapping.tp_size
+        tp_rank = self.mapping.tp_rank
That keeps the MTP sharding consistent with the repurposed TP groups.

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)
645-705: Helix merge: avoid hardcoded tokens_per_block and ensure total_input_len_cp is available on children

Two points in the Helix path:

Hardcoded tokens_per_block=32
elif cp_type == CpType.HELIX:
    return self._merge_helix_requests(
        new_requests,
        tokens_per_block=32)
        # tokens_per_block=cp_config['tokens_per_block'])
This ignores any configured Helix block size (e.g. via cp_config['tokens_per_block'] or KV cache config) and makes the behavior fragile if someone changes the configured block size away from 32.

It also repeats a TODO you already noted to remove this hardcoding.

Suggestion:

Prefer pulling from config with a safe default + assert, e.g.:
tokens_per_block = cp_config.get('tokens_per_block', 32)
assert tokens_per_block > 0
return self._merge_helix_requests(new_requests, tokens_per_block=tokens_per_block)
or at minimum assert that a configured value, if present, matches 32 so misconfigurations fail loudly instead of silently diverging.

total_input_len_cp not propagated to child requests
req = executor_request_to_llm_request(...)
req.total_input_len_cp = input_len
req_with_children.append(req)
if req.child_requests:
    req_with_children.extend(req.child_requests)
executor_request_to_llm_request creates child requests via LlmRequest.create_child_request, which only copies attributes whose names start with py_.

As a result, total_input_len_cp exists only on the parent; any downstream code that expects this attribute on every LlmRequest (including children when num_return_sequences > 1) will not find it.

Possible fix:

Either rename to follow the py_ convention so it’s auto-copied:
req.py_total_input_len_cp = input_len
for child in req.child_requests:
    child.py_total_input_len_cp = input_len
Or, if you deliberately want a non-py_ attribute, explicitly set it on children in this loop.

This will keep Helix metadata consistent across parent and child requests and future-proof the code against differing tokens_per_block configs.
#!/bin/bash
# Check how Helix-related fields are used so they stay consistent.
rg -n "total_input_len_cp" -C3
rg -n "tokens_per_block" tensorrt_llm/_torch/pyexecutor -C3
Also applies to: 710-723

🧹 Nitpick comments (5)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

440-449: Decode-time KV allocation correctly gated on active Helix rank

Marking req.py_helix_is_inactive_rank on non-last CP ranks when has_cp_helix() and skipping add_token there ensures only the active Helix rank allocates decode-time KV cache, which matches the design.

You might consider using mapping.is_last_cp_rank() (and/or setting this flag once at request construction) for slightly clearer intent, but the current logic is functionally sound.

examples/llm-api/quickstart_advanced.py (1)

71-76: cp_size flag and context_parallel_size wiring are consistent

The new --cp_size argument and its use as context_parallel_size=args.cp_size in the LLM constructor align with the new CP plumbing. The change is self-contained and doesn’t affect existing callers.

Optionally, you might extend the help string for --cp_size to mention that multi-CP currently implies Helix in this flow, so users know what they’re opting into.

Also applies to: 261-264

tests/integration/defs/disaggregated/test_disaggregated.py (1)

154-274: New DeepSeek V3 Lite bf16 Helix disaggregated test wiring looks consistent

The new config entry and test_disaggregated_deepseek_v3_lite_bf16_tllm_gen_helix follow the same symlink + run_disaggregated_test pattern as the existing DeepSeek tests, and the test_desc string matches the key added to config_map, so the wiring looks correct.

If you want to silence Ruff’s ARG001 warning, you could rename disaggregated_test_root to _disaggregated_test_root in the new test (or add a # noqa: ARG001), but that’s cosmetic and consistent with the rest of this file.

Also applies to: 1915-1933
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
12-13: Remove duplicate LlmRequest import

LlmRequest is imported twice in this file (here and again at line 62). You can safely drop the earlier import and keep the one that also brings in get_draft_token_length:
-from .llm_request import LlmRequest
-
 import torch
This keeps imports minimal without changing behavior.
tensorrt_llm/commands/serve.py (1)
5-5: Drop the duplicate gc import
Line 2 already imports gc, so this second import triggers Ruff F811 (Redefinition of unused gc). Please drop the extra line to keep lint happy.
-import gc

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b2abb8 and 50436a1.

📒 Files selected for processing (18)

cpp/tensorrt_llm/kernels/mlaKernels.cu (4 hunks)
cpp/tensorrt_llm/kernels/mlaKernels.h (1 hunks)
cpp/tensorrt_llm/thop/attentionOp.cpp (2 hunks)
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (6 hunks)
examples/llm-api/quickstart_advanced.py (2 hunks)
tensorrt_llm/_torch/attention_backend/trtllm.py (9 hunks)
tensorrt_llm/_torch/distributed/communicator.py (2 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (12 hunks)
tensorrt_llm/_torch/modules/attention.py (10 hunks)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1 hunks)
tensorrt_llm/commands/serve.py (7 hunks)
tensorrt_llm/llmapi/disagg_utils.py (1 hunks)
tensorrt_llm/mapping.py (1 hunks)
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_gentp1cp2_deepseek_v3_lite_bf16_tllm_gen.yaml (1 hunks)
tests/integration/defs/disaggregated/test_disaggregated.py (2 hunks)

🧰 Additional context used

🧠 Learnings (27)

📓 Common learnings

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

📚 Learning: 2025-08-14T15:43:23.107Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/_torch/attention_backend/trtllm.py
tensorrt_llm/_torch/modules/attention.py

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/_torch/attention_backend/trtllm.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/llmapi/disagg_utils.py
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/llmapi/disagg_utils.py
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
cpp/tensorrt_llm/kernels/mlaKernels.cu
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/attention_backend/trtllm.py

📚 Learning: 2025-08-15T06:46:53.813Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/attention_backend/trtllm.py

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp

📚 Learning: 2025-08-14T06:36:40.701Z

Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py

📚 Learning: 2025-08-26T06:07:02.166Z

Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

📚 Learning: 2025-09-23T14:58:05.372Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

tensorrt_llm/llmapi/disagg_utils.py
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/attention_backend/trtllm.py

📚 Learning: 2025-09-02T13:42:44.885Z

Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

tensorrt_llm/_torch/distributed/communicator.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/disaggregated/test_disaggregated.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/integration/defs/disaggregated/test_disaggregated.py

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

tests/integration/defs/disaggregated/test_disaggregated.py

📚 Learning: 2025-08-08T22:03:40.707Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

cpp/tensorrt_llm/kernels/mlaKernels.cu

📚 Learning: 2025-09-23T15:01:00.070Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

cpp/tensorrt_llm/kernels/mlaKernels.cu

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

cpp/tensorrt_llm/kernels/mlaKernels.cu

📚 Learning: 2025-08-15T06:46:54.897Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

cpp/tensorrt_llm/kernels/mlaKernels.cu
tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

tensorrt_llm/mapping.py

📚 Learning: 2025-08-21T09:41:49.347Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: 2025-08-20T06:48:45.368Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tensorrt_llm/commands/serve.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

tensorrt_llm/commands/serve.py

📚 Learning: 2025-08-27T14:23:55.566Z

Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

tensorrt_llm/_torch/modules/attention.py

🧬 Code graph analysis (12)

tensorrt_llm/_torch/distributed/communicator.py (2)

tensorrt_llm/mapping.py (3)

Mapping (351-515)

rank (187-188)

rank (191-198)

tensorrt_llm/llmapi/llm_args.py (2)

world_size (459-460)

world_size (469-473)

tests/integration/defs/disaggregated/test_disaggregated.py (2)

tests/integration/defs/conftest.py (4)

disaggregated_test_root (2618-2623)

disaggregated_example_root (285-290)

llm_venv (702-719)

deepseek_v3_model_root (616-631)

tests/integration/defs/local_venv.py (1)

get_working_directory (43-49)

cpp/tensorrt_llm/kernels/mlaKernels.cu (1)

cpp/tensorrt_llm/kernels/mlaKernels.h (2)

helix_position_offsets (109-134)

helix_is_inactive_rank (112-113)

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (2)

tensorrt_llm/_torch/distributed/communicator.py (5)

tp_size (64-65)

has_pp (52-53)

cp_size (56-57)

rank (40-41)

rank (457-458)

tensorrt_llm/mapping.py (3)

has_pp (258-259)

rank (187-188)

rank (191-198)

tensorrt_llm/mapping.py (1)

tensorrt_llm/_torch/distributed/communicator.py (2)

cp_size (56-57)

cp_config (108-109)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)

tensorrt_llm/runtime/model_runner.py (1)

mapping (824-825)

tensorrt_llm/_torch/distributed/communicator.py (3)

has_cp_helix (104-105)

cp_rank (68-69)

cp_size (56-57)

tensorrt_llm/mapping.py (2)

has_cp_helix (233-235)

cp_rank (534-535)

tensorrt_llm/_torch/device_mesh.py (1)

cp_rank (84-86)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)

tensorrt_llm/_torch/distributed/communicator.py (8)

cp_size (56-57)

cp_rank (68-69)

tp_size (64-65)

world_size (44-45)

rank (40-41)

rank (457-458)

cp_config (108-109)

pp_size (60-61)

tensorrt_llm/mapping.py (4)

cp_rank (534-535)

Mapping (351-515)

rank (187-188)

rank (191-198)

tensorrt_llm/_torch/model_config.py (1)

ModelConfig (75-616)

tensorrt_llm/_torch/pyexecutor/model_engine.py (4)

tensorrt_llm/_torch/pyexecutor/llm_request.py (5)

LlmRequest (437-663)

append (101-127)

append (195-212)

cached_tokens (569-570)

cached_tokens (573-576)

tensorrt_llm/mapping.py (3)

CpType (24-32)

has_cp_helix (233-235)

cp_rank (534-535)

tensorrt_llm/_torch/distributed/communicator.py (3)

cp_size (56-57)

has_cp_helix (104-105)

cp_rank (68-69)

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

is_warmup (344-345)

is_warmup (348-353)

tensorrt_llm/commands/serve.py (3)

tensorrt_llm/runtime/model_runner.py (1)

mapping (824-825)

tensorrt_llm/mapping.py (1)

CpType (24-32)

tensorrt_llm/_torch/distributed/communicator.py (4)

cp_config (108-109)

tp_size (64-65)

pp_size (60-61)

cp_size (56-57)

examples/llm-api/quickstart_advanced.py (1)

tensorrt_llm/_torch/distributed/communicator.py (1)

cp_size (56-57)

tensorrt_llm/_torch/attention_backend/trtllm.py (3)

cpp/tensorrt_llm/kernels/mlaKernels.h (2)

helix_position_offsets (109-134)

helix_is_inactive_rank (112-113)

tensorrt_llm/_torch/attention_backend/flashinfer.py (1)

cached_token_lens (116-118)

tensorrt_llm/_torch/attention_backend/interface.py (2)

seq_lens_kv (226-227)

seq_lens_kv (230-237)

tensorrt_llm/_torch/modules/attention.py (3)

cpp/tensorrt_llm/kernels/helixKernels.h (1)

tensorrt_llm (26-46)

tensorrt_llm/mapping.py (1)

CpType (24-32)

cpp/tensorrt_llm/kernels/mlaKernels.h (1)

helix_position_offsets (109-134)

🪛 Ruff (0.14.5)

tensorrt_llm/_torch/distributed/communicator.py

349-349: f-string without any placeholders

Remove extraneous f prefix

(F541)

367-367: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/integration/defs/disaggregated/test_disaggregated.py

1919-1919: Unused function argument: disaggregated_test_root

(ARG001)

tensorrt_llm/_torch/models/modeling_deepseekv3.py

1561-1561: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)

1588-1588: f-string without any placeholders

Remove extraneous f prefix

(F541)

1660-1660: f-string without any placeholders

Remove extraneous f prefix

(F541)

tensorrt_llm/commands/serve.py

5-5: Redefinition of unused gc from line 2

Remove definition: gc

(F811)

130-131: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

130-131: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (8)

tensorrt_llm/llmapi/disagg_utils.py (1)

199-201: Including PP and CP in instance_num_ranks is consistent with mapping/world-size logic

Using tensor_parallel_size * pipeline_parallel_size * context_parallel_size makes the per-instance rank count match the full mapping used later in split_world_comm, while preserving backward compatibility when PP/CP are omitted (default 1). Looks good.

tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_gentp1cp2_deepseek_v3_lite_bf16_tllm_gen.yaml (1)

1-32: Helix disaggregated gen config is internally consistent

Context and generation sections use TP/CP sizes in a way that matches the updated disaggregation logic (context_parallel_size only on the generation side for Helix decode-only). No issues spotted.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

441-513: py_helix_is_inactive_rank flag wiring is consistent

Initializing self.py_helix_is_inactive_rank = False alongside other py_* fields and relying on create_child_request’s py_ copying is exactly what the resource manager needs for per-request Helix inactivity tracking. No changes requested.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)

316-320: CP-aware attachment of Python-only request metadata looks good

Extending the condition to (tp_size > 1 or has_pp or cp_size > 1) ensures CP-only topologies also receive Python-side attributes on non-root ranks, without affecting single-rank setups. This is the right direction.

cpp/tensorrt_llm/thop/attentionOp.cpp (1)

181-238: No changes needed; concerns are not supported by the codebase

This review comment identifies issues that do not exist in the actual code:

Backward compatibility: All current Python call sites already pass exactly 2 tensors in mla_tensor_params (lines 481 and 1771 in tensorrt_llm/_torch/attention_backend/trtllm.py). The size() == 2 check enforces the existing pattern consistently, not a breaking change.

Uninitialized helix fields: The MlaParams struct explicitly initializes both helix_position_offsets{nullptr} and helix_is_inactive_rank{nullptr} as member initializers, ensuring they are always zero-initialized regardless of constructor form. The suggested MlaParams<T> mla_params{} is redundant.

API design: The required parameter without defaults aligns with TensorRT-LLM's intentional design for consistency in tensor parameter collections (documented in learnings from PR 6904).

Likely an incorrect or invalid review comment.

cpp/tensorrt_llm/kernels/mlaKernels.h (1)

108-112: Helix inactivity flag plumbed cleanly into MLA params

The new helix_is_inactive_rank pointer mirrors the existing helix_position_offsets pattern and keeps the struct’s nullable‑by‑default contract; no issues spotted here.

cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (1)

45-70: Helix inactivity flag is correctly threaded through MLA rope generation

The new helix_is_inactive_rank pointer is plumbed cleanly from mla_tensor_params into MlaRopeGenArgs and then into MlaParams, with a strict but reasonable contract (mla_tensor_params.size() == 2). Optional handling via has_value() ? data_ptr : nullptr matches the existing pattern for helix_position_offsets. No functional issues spotted.

Also applies to: 88-111, 139-168, 277-283
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
545-566: Warmup gating by cp_type now correctly excludes only ULYSSES/STAR

The updated warmup logic:
cp_type = self.mapping.cp_config.get('cp_type', None)
if cp_type is not None:
    if cp_type in [CpType.ULYSSES, CpType.STAR]:
        return
means HELIX (and other non‑ULYSSES/STAR cp types) still run warmup, which is what you want for Helix CUDA graph / torch.compile specialization. This looks consistent with the new Helix integration and doesn’t affect non‑CP runs.

tensorrt_llm/_torch/attention_backend/trtllm.py

tensorrt_llm/_torch/distributed/communicator.py

tensorrt_llm/_torch/models/modeling_deepseekv3.py

tensorrt_llm/commands/serve.py

tensorrt_llm/_torch/distributed/communicator.py

tensorrt_llm/_torch/models/modeling_deepseekv3.py

chuangz0

looks good to me for disagg part

tensorrt-cicd · 2025-11-27T23:53:57Z

PR_Github #26073 [ run ] triggered by Bot. Commit: 83d6416

tensorrt-cicd · 2025-11-28T07:23:13Z

PR_Github #26073 [ run ] completed with state SUCCESS. Commit: 83d6416
/LLM/main/L0_MergeRequest_PR pipeline #19797 completed with status: 'FAILURE'

MatthiasKohl

LGTM

brb-nv · 2025-11-28T15:28:02Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-28T15:33:53Z

PR_Github #26214 [ run ] triggered by Bot. Commit: 83d6416

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv · 2025-11-28T20:49:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-28T20:55:14Z

PR_Github #26228 [ run ] triggered by Bot. Commit: 98051c3

tensorrt-cicd · 2025-11-28T20:55:15Z

PR_Github #26214 [ run ] completed with state ABORTED. Commit: 83d6416
LLM/main/L0_MergeRequest_PR #19915 (Blue Ocean) completed with status: ABORTED

brb-nv · 2025-11-29T00:07:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-29T00:13:08Z

PR_Github #26230 [ run ] triggered by Bot. Commit: 98051c3

tensorrt-cicd · 2025-11-29T00:13:10Z

PR_Github #26228 [ run ] completed with state ABORTED. Commit: 98051c3
LLM/main/L0_MergeRequest_PR #19926 (Blue Ocean) completed with status: ABORTED

tensorrt_llm/bench/benchmark/low_latency.py

tensorrt-cicd · 2025-11-29T07:40:12Z

PR_Github #26230 [ run ] completed with state SUCCESS. Commit: 98051c3
/LLM/main/L0_MergeRequest_PR pipeline #19928 completed with status: 'FAILURE'

brb-nv · 2025-11-29T18:52:52Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-29T18:58:47Z

PR_Github #26269 [ run ] triggered by Bot. Commit: 98051c3

tensorrt-cicd · 2025-11-29T21:05:44Z

PR_Github #26269 [ run ] completed with state SUCCESS. Commit: 98051c3
/LLM/main/L0_MergeRequest_PR pipeline #19963 completed with status: 'FAILURE'

brb-nv · 2025-11-29T21:21:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-29T21:27:30Z

PR_Github #26272 [ run ] triggered by Bot. Commit: 98051c3

tensorrt-cicd · 2025-11-29T23:14:47Z

PR_Github #26272 [ run ] completed with state SUCCESS. Commit: 98051c3
/LLM/main/L0_MergeRequest_PR pipeline #19966 completed with status: 'SUCCESS'

Hi Frank, thank you for catching an issue in the incoming change.

Given the suggested change is small and won't break any existing functionality, I'll take care of it in a follow-up MR. Main reason behind this is CI instability.

I've been trying to getting a passing CI for the last 3-4 days. So, trying to make best use of current passing CI instead of kicking off another one from scratch. Thank you for your understanding.

brb-nv · 2025-11-30T04:34:40Z

Follow-up MR to address @FrankD412's comment:
#9552

I'd like to do some more testing before we expose cp configs through eval and bench. Currently tested mostly with serve.

…VIDIA#8779) The performance results of some kernels could be easily affected by the warm/cold L2 cache status. To achieve more precise profiling results, the L2 cache is cleared for every execution by the circular buffer method for better benchmarking during autotuning. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> [None][infra] Waive failed cases for main branch on 11/25 (NVIDIA#9429) Signed-off-by: qqiao <qqiao@nvidia.com> [NVIDIA#8391][chore] test_perf.py to lock clocks read from gpu_configs.yml instead of max freq (NVIDIA#9409) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> [None][ci] Move more test stages to use OCI machines (NVIDIA#9395) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Matt Lefebvre <matthewelefebvre@gmail.com> [None][feat] Improve TRTLLM MoE in small hidden size throughput cases (NVIDIA#9377) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com> [https://nvbugs/5537996][fix] Let KV cache manager block initialization be aware whether it is doing a dry run or not (NVIDIA#9093) Before this commit, the kv cache manager does the same regardless, which causes a mis-calculation in free memory available to allocate for the KV cache manager, hence causing a crash. This commit fixes this by letting KV cache manager initialization be aware whether it is doing the dry run or not. If it is a dry run, use the max_tokens setting that is already pre-calculated and filled into kv_cache_config.max_tokens. Signed-off-by: eopXD <yuehtingc@nvidia.com> [https://nvbugs/5667922][fix] Update long context evaluation config (NVIDIA#9426) Signed-off-by: mni <125171826+baize97@users.noreply.github.com> [None][fix] Mitigate test timeout issues (NVIDIA#9445) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> [None][chore] Fix trtllm-eval for PyTorchLLM (NVIDIA#9427) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> [None][feat] Add a parser to layer-wise benchmarks (NVIDIA#9440) Signed-off-by: Tailing Yuan <yuantailing@gmail.com> [None][feat] Support custom chat template for tool calling (NVIDIA#9297) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com> [TRTLLM-8160][feat] Add draft token tree runtime on CDL (NVIDIA#8586) Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> [None][ci] waive a test (NVIDIA#9458) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> [https://nvbugs/5680905][fix] Relax the MMLU accuracy requirement for DS-v3.2 (NVIDIA#9439) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> [TRTLLM-8376][feat] top-p optimization (removes redundant softmax) (NVIDIA#9411) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com> [TRTLLM-9490][feat] use FlashInfer's top_k_sampling_from_probs (NVIDIA#9457) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com> [https://nvbugs/5647400] [fix] Enlarged the AllReduce workspace size to 64MB. Added AllReduce strategy to AD config. (NVIDIA#9145) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> [TRTLLM-909][feat] Overlap context chunks in pipeline parallel mode (NVIDIA#9308) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> [None][chore] AutoDeploy add multi stream moe pass to default.yaml (NVIDIA#9430) Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> [https://nvbugs/5685143][fix] avoid cudaFree overlap with cuda graph (NVIDIA#9438) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> [None][chore] Bump version to 1.2.0rc5 (NVIDIA#9455) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> [TRTLLM-8936][test] Add disagg and wideep multi-node multi-gpu test cases (NVIDIA#9356) Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com> [None][ci] move some slow test cases of DGX-B200 to post merge (NVIDIA#9467) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> [TRTLLM-9293][feat] Enable partial weight loading to support streaming update weights (NVIDIA#9224) Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> [None][infra] Check in most recent lock file from nightly pipeline Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com> [TRTLLM-9264][fix] Add accuracy/unit tests/doc for phi4mm (NVIDIA#9246) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com> [https://nvbugs/5580099][fix] Cherry pick IMA issue fix from release/1.1 (NVIDIA#9032) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> [None][chore] Upgrade CuteDSL to 4.3.0 (NVIDIA#9444) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> [None][feat] Support MLA chunked prefill for DeepSeek V3.2 model (NVIDIA#9376) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com> [None][feat] Add environment variable to force spec-dec number of accepted tokens (NVIDIA#9371) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> [None][infra] Update allowed list 2025.11.25 (NVIDIA#9468) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com> [None][infra] Fail the pipeline when slurm ssh dropped (NVIDIA#9157) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com> [None][feat] AutoDeploy: Remove redundant copies in mamba layers (NVIDIA#9461) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> [None][feat] AutoDeploy: Add A_log fusion for Mamba layers (NVIDIA#9422) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> [None][ci] Waive blackwell test on spec gate. (NVIDIA#9502) Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com> [https://nvbugs/5608930][fix] Fix a typo (NVIDIA#9487) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> [NVIDIA#9463][feat] Add revision option to trtllm commands (NVIDIA#9498) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> [TRTLLM-9085][doc] fix math formula rendering issues (NVIDIA#9481) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> [None][chore] update comments in llm_args.py (NVIDIA#9472) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> [None][infra] Check in most recent lock file from nightly pipeline Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com> [https://nvbugs/5680310][fix] Fix ctx only timed out test (NVIDIA#9410) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> [https://nvbugs/5547414][fix] enable case after using local cache model (NVIDIA#9473) Signed-off-by: Hui Gao <huig@nvidia.com> [None][fix] Replace PYTORCH_CUDA_ALLOC_CONF with PYTORCH_ALLOC_CONF to fix deprecation warning (NVIDIA#9294) Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com> [https://nvbugs/5698581][fix] Init draft tokens for CUDA graph dummy request (NVIDIA#9505) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com> [None][infra] Waive failed case in pre-merge on 11/27 (NVIDIA#9507) Signed-off-by: qqiao <qqiao@nvidia.com> [TRTLLM-9513][docs] Qwen3 deployment guide (NVIDIA#9488) Signed-off-by: Lanyu Liao <laliao@laliao-mlt.client.nvidia.com> Co-authored-by: Lanyu Liao <laliao@laliao-mlt.client.nvidia.com> [None][chore] revert batch_size=1 to prevent timeout and lower accuracy reference by 0.12% as a WAR (NVIDIA#9447) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> Co-authored-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com> [TRTLLM-9279][infra] Use flexcache for gh200 nodes since they locate in Austin (NVIDIA#9405) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> [cherry-pick][https://nvbugs/5670793][fix] Solve trtllm-serve launch_disaggregated issue (NVIDIA#9346) Signed-off-by: xxi <xxi@nvidia.com> [None][infra] Fix Slurm job script (NVIDIA#9508) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com> [None][fix] change allreduce workspace dtype to torch.int64 to avoid overflow (NVIDIA#9479) Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com> [None][feat] add qwen3-next CI test of accuracy on BF16 and NVFP4 (NVIDIA#9330) Signed-off-by: jiant <107457950+JadoTu@users.noreply.github.com> [None][fix] fix TP support for DeepSeek-V3.2 on hopper (NVIDIA#9484) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> [TRTLLM-9389][chore] Refactor AlltoallMethodType. (NVIDIA#9388) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> [https://nvbugs/5674665][chore] Add test coverage for https://nvbugspro.nvidia.com/bug/5674665 (NVIDIA#9518) Signed-off-by: eopXD <yuehtingc@nvidia.com> [TRTLLM-7288][infra] Download merged waive list in slurm script (NVIDIA#8999) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> [https://nvbugs/5687820][fix] Remove self.abort() in DetokenizedGenerationResult (NVIDIA#9449) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> [NVIDIA#9150][feat] AutoDeploy Nemotron-Flash support (NVIDIA#9504) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> [None] [chore] Update to cutlass 4.3 (NVIDIA#8637) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> [https://nvbugs/5637037][chore] Update waive lists. (NVIDIA#9386) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> [None][infra] Check in most recent lock file from nightly pipeline Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com> [TRTLLM-8970][infra] Fix generate report when has isolation test result (NVIDIA#8861) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> [https://nvbugs/5685015][fix] Update invalid max_token test (NVIDIA#9435) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> [None][fix] Fix on-disk cache and revise logger/statistics for AutoTuner. (NVIDIA#9211) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> [https://nvbugs/5689658][test] Fix gpu lock issue running on cluster (NVIDIA#9441) Signed-off-by: yufeiwu <230315618+yufeiwu-nv@users.noreply.github.com> [None][chore] add spec_decoding configs in perf benchmark scripts and fix typos (NVIDIA#9533) Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com> Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com> [None][fix] Remove FP8 K/V buffer from TRTLLM sparse MLA attention kernel (NVIDIA#9529) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com> [None] [chore] Enhancements and clean up to slurm scripts (NVIDIA#9493) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> [None][chore] Revert "[None][fix] change allreduce workspace dtype to torch.int64 t… (NVIDIA#9538) Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com> [None][infra] Waive failed cases for main branch on 11/28 (NVIDIA#9539) Signed-off-by: qqiao <qqiao@nvidia.com> [None][fix] Pass checkpoint_format to create_input_processor (NVIDIA#9521) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> [TRTLLM-9541][infra] Use artifactory mirror for download.pytorch.org (NVIDIA#9477) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> [TRTLLM-9488][feat] add 'disable_flashinfer_sampling' config option (NVIDIA#9454) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com> [None][infra] Waive failed case in pre-merge on 11/28 (NVIDIA#9537) Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> [None][perf] Helix: improve all-to-all perf for large CP size (NVIDIA#9494) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com> Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com> Co-authored-by: Zheyu Fu <zheyuf@nvidia.com> [None][feat] support for more accurate AR calculation (NVIDIA#9323) Signed-off-by: binghanc <176802681+binghanc@users.noreply.github.com> [TRTLLM-9488][fix] llmapi references (NVIDIA#9547) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com> [NVIDIA#8948][feat] Support custom sharding config (NVIDIA#9143) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com> [None][infra] Check in most recent lock file from nightly pipeline Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com> [None][chore] Weekly mass integration of release/1.1 -- rebase (NVIDIA#9522) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> Signed-off-by: qgai <qgai@nvidia.com> Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> Signed-off-by: Simeng Liu <simengl@nvidia.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: Vincent Zhang <vinczhang@nvidia.com> Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com> Signed-off-by: Michal Guzek <mguzek@nvidia.com> Signed-off-by: Michal Guzek <moraxu@users.noreply.github.com> Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com> Signed-off-by: leslie-fang25 <leslief@nvidia.com> Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Co-authored-by: yunruis <205571022+yunruis@users.noreply.github.com> Co-authored-by: sunnyqgg <159101675+sunnyqgg@users.noreply.github.com> Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com> Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com> Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com> Co-authored-by: Guoming Zhang <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Vincent Zhang <vcheungyi@163.com> Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com> Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com> Co-authored-by: Chang Liu <9713593+chang-l@users.noreply.github.com> Co-authored-by: Leslie Fang <leslief@nvidia.com> Co-authored-by: Shunkangz <182541032+Shunkangz@users.noreply.github.com> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com> [TRTLLM-5971][feat] Integrate helix parallelism (NVIDIA#9342) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> [None][infra] Check in most recent lock file from nightly pipeline Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com> [None][infra] - Request idle time exemption for OCI jobs (NVIDIA#9528) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> [None][infra] Wiave failed tests for main branch on 11/30 (NVIDIA#9555) Signed-off-by: qqiao <qqiao@nvidia.com> [None][fix] Fix port conflict in disagg tests (NVIDIA#9474) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> [None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage (NVIDIA#9558) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> [None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage (NVIDIA#9559) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> [TRTLLM-8958][feat] and [TRTLLM-8960]: create ConfigurableMoE and support TRTLLMGenFusedMoE as backend (NVIDIA#9486) [None] [feat] Optimize the algorithm part of RocketKV (NVIDIA#9333) Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com> [https://nvbugs/5690172][fix] Fix Qwen3-235B ATP accuracy issue with PDL (NVIDIA#9530) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> [TRTLLM-6222][feat] Extend cute_dsl_nvfp4_gemm to sm103. (NVIDIA#9543) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com> [None][fix] Correct virtual memory allocation alignment (NVIDIA#9491) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> [None][infra] Check in most recent lock file from nightly pipeline Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com> [https://nvbugs/5684703][fix] Unwaive disagg guided decoding test (NVIDIA#9466) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> [https://nvbugs/5503479][fix] Temporarily lower reference accuracy to stabilize CI (NVIDIA#9398) Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com> [None][chore] remove qwen3-next accuracy tests (NVIDIA#9534) Signed-off-by: jiant <107457950+JadoTu@users.noreply.github.com> [None][doc] fix mtp.py typo (NVIDIA#9307) Signed-off-by: liugaoji <757394026@qq.com> [None][feat] add chat template kwargs support to longbench-v2 (NVIDIA#9544) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> [NVIDIA#9496][fix] AutoDeploy: remove auto-tuner from nvfp4_gemm forward (NVIDIA#9497) Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> [None][fix] Replace hash method with unique_id for cutedsl MoE runners. (NVIDIA#9569) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> [None][chore] refactor disaggregated scripts to use named arguments (NVIDIA#9581) Signed-off-by: Zhenhuan Chen <zhenhuanc@nvidia.com> [TRTLLM-6222][feat] Several perf opt for cuteDSL nvf4 gemm (NVIDIA#9428) Signed-off-by: Yuhan Li <51736452+liyuhannnnn@users.noreply.github.com> [None][chore] reduce the layers of the `devel` docker image (NVIDIA#9077) Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com> [https://nvbugs/5651854][infra] Enable perf metrics during accuracy testing (NVIDIA#9140) [None][fix] Skip Allreduce init for Attention DP (NVIDIA#9542) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> [None][test] [None][test] Waive main branch test failures 12/1 (NVIDIA#9566) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> [None][ci] Minor change for Slurm scripts (NVIDIA#9561) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> [TRTLLM-6768][infra] Fix params for not updating github status (NVIDIA#6747) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> [None][infra] Update the pytest options after MI (NVIDIA#9579) Signed-off-by: qqiao <qqiao@nvidia.com> [TRTLLM-6756][feat] Add Beam Search to TorchSampler (NVIDIA#8509) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com> [None][chore] Defer exposing context parallel configs (NVIDIA#9552) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> [TRTC-1943][feat] Env vars override support in LLM API (NVIDIA#9104) Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com> [None][feat] AutoDeploy: Use the router gemm op for nemotron MOE (NVIDIA#9500) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> [NVIDIA#9198][feat] Refactor dist ops in AutoDeploy (NVIDIA#9301) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> [None][fix] Prevent YAML partial kv_cache_config from incorrectly overriding the complete kv_cache_config (NVIDIA#9262) Signed-off-by: Yuening Li <62227368+Yuening-wa@users.noreply.github.com> [TRTLLM-9085][doc] fix math formula rendering issues in github (NVIDIA#9605) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> [None][feat] Unify nvfp4 gemm backend (NVIDIA#8963) Signed-off-by: Shijie Wang <jaywan@nvidia.com> Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Shijie <jaywan@nvidia.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com> [None][feat] Add support for KVCache reuse for DSv32 (NVIDIA#9383) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> [None][infra] Check in most recent lock file from nightly pipeline Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com> [None][chroe] Polish qwen3-next modeling code. (NVIDIA#8902) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> [https://nvbugs/5703953][fix] Use random port for disagg tests (NVIDIA#9582) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> [None][fix] Waive gb200 (NVIDIA#9580) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com> [FMDL-1328][feat] Add support for nano-v3 and super-v3 with pytorch backend (NVIDIA#9261) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com> [https://nvbugs/5582091][test] increase warmup times in testing for multi-gpu cases (NVIDIA#9578) Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com> Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com> [None][chore] Add failed cases into waives.txt (NVIDIA#9588) Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> [https://nvbugs/5702793][fix] Fix uncontiguous tensor view (NVIDIA#9576) Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> [None][infra] Waive failed cases for main branch (NVIDIA#9615) Signed-off-by: qqiao <qqiao@nvidia.com> [TRTLLM-9488][feat] use FlashInfer.sampling by default (NVIDIA#9545) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com> [None][infra] Update allowlist 2025/12/01 (NVIDIA#9616) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com> [None][infra] Remove an invalid test name in waives.txt (NVIDIA#9620) Signed-off-by: qqiao <qqiao@nvidia.com> Lock the gpu clocks in L0 perf tests (NVIDIA#9585) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> [TRTLLM-9466][test] Evaluate helix parallelism with DSV3 Lite (NVIDIA#9597) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> [None][fix] Extract GPU count from single-node stage names (NVIDIA#9599) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com> [https://nvbugs/5667774][fix] Refine Piecewise Cuda Graph Condition for DP (NVIDIA#9393) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> [TRTLLM-9144][fix] enhance RPC robustness (NVIDIA#8711) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com> [https://nvbugs/5627710][fix] Fix synchronization bugs in KvCacheTransferManager that can cause corrupted blocks (NVIDIA#9056) Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com> Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> Co-authored-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> [TRTLLM-8980][test] Clean up spec dec tests in test_llm_api_pytorch (NVIDIA#8889) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> [NVIDIA#9150][feat] Add code for nano v3 to custom implementation in AD (NVIDIA#9465) * Why? We would like to show an alternative to monkey-patching in AutoDeploy. * What? This commit builds on the existing custom model implementation for NemotronH and adds the bits relevant for MoE layers. Part of NVIDIA#9150. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com> [NVIDIA#9150][feat] AutoDeploy: reviewer comments for NVIDIA#9150 (NVIDIA#9527) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> [https://nvbugs/5651854][fix] Fix dist-serving perf by clearing CPU affinity (NVIDIA#9549) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> [NVIDIA#9550][feat] AutoDeploy: Add NVFP4 Cutlass MoE kernels (NVIDIA#9551) Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> [https://nvbugs/5688388][fix] fix: Reducing num request in disagg test to speed up (NVIDIA#9598) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> [TRTLLM-8946][feat] Improved heuristics to detect shardable regions (NVIDIA#9200) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com> Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> [NVIDIA#9632][feat] Support EXTRA_WHEEL_BUILD_ARGS during wheel build (NVIDIA#9633) Signed-off-by: Yu Chi Li <yuchil@nvidia.com> [None][chore] Waive test failing on pre-merge (NVIDIA#9638) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> [None][chore] Remove traceback dump for multimodal input processor (NVIDIA#9634) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com> [None][chore] Fix trtllm-eval and move GroupedGemmInputsHelper (NVIDIA#9612) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> [https://nvbugs/5698434][fix] Use separate weight mapper for draft (NVIDIA#9607) Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> [TRTLLM-7101][infra] Reuse passed tests (NVIDIA#6894) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> [None][test] Remove duplicate test cases (NVIDIA#9623) Signed-off-by: yufeiwu <230315618+yufeiwu-nv@users.noreply.github.com> [None][infra] Check in most recent lock file from nightly pipeline Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com> [None][feat] Add RocketKV usage doc and e2e accuracy test on LongBenchV2 (NVIDIA#9572) Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com> [TRTLLM-9242][doc] Add examples showcasing openai compatible APIs (NVIDIA#9520) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> [None][chore] AutoDeploy update cuda stream manager for multi-device (NVIDIA#9575) Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> [TRTLLM-9391][chore] Automatically estimate required workspace. (NVIDIA#9535) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> [https://nvbugs/5708475][fix] Fix e2e eval accuracy for helix parallelism (NVIDIA#9647) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> [https://nvbugs/5561153][test] Fix log error for perf test (NVIDIA#9622) Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com> [TRTLLM-8241][feat] Aliasing to comply to LlmArgs (NVIDIA#9586) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com> [None][chore] Add failed cases into waives.txt (NVIDIA#9593) Signed-off-by: Jie Li <lijie@nvidia.com> Co-authored-by: Jie Li <lijie@nvidia.com> [TRTLLM-6842][feat] Support Response API for general purpose (NVIDIA#9392) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> [None][test] Update Qwen3-next accuracy testing by setting the cuda … (NVIDIA#9613) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> [None][feat] update trtllm-gen nvfp4 kernels with better performance (NVIDIA#9510) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> [None][doc] Replace the tensorrt icon with torch icon on overview.md (NVIDIA#9644) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> [https://nvbugs/5705197][chore] Unwaive timeout disagg tests (NVIDIA#9637) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> [https://nvbugs/5552132][fix] Enable LoRa for GPT OSS Torch (NVIDIA#8253) Signed-off-by: Michal Guzek <mguzek@nvidia.com> [None][fix] Fix wide ep MoE error (NVIDIA#9642) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> [https://nvbugs/5702795][fix] Remove the warning message for aten.log. (NVIDIA#9665) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> [https://nvbugs/5693853][fix] Fix error handling when querying machin… (NVIDIA#9483) Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com> [OMNIML-2932] [feat] nvfp4 awq support (NVIDIA#8698) Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> [NVIDIA#9643][fix] AutoDeploy: fix nano sharding config (NVIDIA#9668) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> [NVIDIA#9147][feat] AutoDeploy: Draft Target Speculative Decoding (NVIDIA#9275) Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> [None][feat] Update Qwen3CodeToolParser to align tool-calling parameters (NVIDIA#9540) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com> [TRTLLM-7181][infra] Generate test results when pytest timeout happens (NVIDIA#9396) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> [None][infra] Check in most recent lock file from nightly pipeline Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com> [TRTLLM-9522][fix] restore `trtllm-serve mm_embedding_serve` (NVIDIA#9669) [TRTLLM-5093][infra] Write env variables to a file in the interactive debug session (NVIDIA#6792) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> [None][fix] fix error when processing batches containing both text and mm data (NVIDIA#8381) Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn> [TRTLLM-7073][feat] Support torch compile for PP for Llama and DeepSeekV3 (NVIDIA#7838) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> [None][feat] Add weights initialization and context phase parser to layer-wise benchmarks (NVIDIA#9667) Signed-off-by: Tailing Yuan <yuantailing@gmail.com> [TRTLLM-8274][feat] Check if executor is shutdown in /health entrypoint (NVIDIA#9057) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> [NVIDIA#8733][feat] Add Llama4 MoE handling to AutoDeploy (NVIDIA#9556) Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com> Signed-off-by: tcherckez-nvidia <127761168+tcherckez-nvidia@users.noreply.github.com> Co-authored-by: Neta Zmora <nzmora@nvidia.com> [None][ci] unwaive tests (NVIDIA#9651) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> [None][feat] Add NIXL-LIBFABRIC support (NVIDIA#9225) Signed-off-by: Yoray Zack <62789610+zackyoray@users.noreply.github.com> Signed-off-by: zackyoray <yorayz@nvidia.com> [None][test] rename wide ep and disagg metric name in perf test (NVIDIA#9704) Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com> Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com> [https://nvbugs/5467531][fix] Unwaive fused_moe all to all test with … (NVIDIA#9617) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> [None][fix] Recover TRTLLM MoE Perf for DEP (NVIDIA#9562) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com> [None][chore] Add failed cases into waives.txt (NVIDIA#9662) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com> Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> [None][fix] Fix TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENS for MTP/EAGLE (NVIDIA#9608) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> [None][infra] Add container notices and documentation (NVIDIA#9185) Signed-off-by: Parker Drake <pdrake@nvidia.com> [TRTLLM-5312][infra] Add triton trigger rules (NVIDIA#6440) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> [None][doc] Add feature docs for helix parallelism (NVIDIA#9684) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> [TRTLLM-9579][infra] Set mergeWaiveList stage UNSTABLE when there is any issue (NVIDIA#9692) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> [None][doc] Added line about partial reuse (NVIDIA#7846) Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com> [TRTLLM-8920][feat] decouple disagg service from fastapi (NVIDIA#8714) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> [https://nvbugs/5633340][fix] start disagg workers and servers on free ports (NVIDIA#9694) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> [TRTLLM-9562] [doc] Add Deployment Guide for Kimi K2 Thinking on TensorRT LLM - Blackwell (NVIDIA#9711) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> [NVIDIA#9602][feat] AutoDeploy: Support TRTLLM Sampler (NVIDIA#9641) Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> [None][infra] Check in most recent lock file from nightly pipeline Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com> [None] [tests] Unwaive EPLB tests (NVIDIA#9625) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> [https://nvbugs/5518713][test] Refactor core test lists by merging with llm_perf_cluster.yml (NVIDIA#9714) Signed-off-by: yufeiwu <230315618+yufeiwu-nv@users.noreply.github.com> [TRTLLM-7136][feat] Update load_weights method to include mapping parameter in checkpoint loaders (NVIDIA#9583) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> [None][refactor] Improve request processing function in sampler (NVIDIA#9671) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> [https://nvbugs/5670672][fix] Fix flaky KV connector tests (NVIDIA#9676) Signed-off-by: jthomson04 <jwillthomson19@gmail.com> [None][infra] Update allowed list 20251204 (NVIDIA#9718) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com> [None][feat] AutoDeploy: Perf optimization for Attention and rmsnorm (NVIDIA#9719) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> [None][chore] Waive flakey disagg tests (NVIDIA#9749) Signed-off-by: Mike Iovine <miovine@nvidia.com> [https://nvbugs/5601682][fix] Fix cacheTransceiver hang (NVIDIA#9311) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> [TRTLLM-9199][docs] KV Connector Docs (NVIDIA#9325) Signed-off-by: jthomson04 <jwillthomson19@gmail.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> [TRTLLM-9160][doc] add doc to llm_runtime.py (NVIDIA#9482) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> [None][doc] VDR 1.0 trtllm-serve doc enhancement (NVIDIA#9443) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> [TRTLLM-9086][doc] Clean up TODOs in documentation (NVIDIA#9292) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> [TRTLLM-9157][doc] Guided decoding doc improvement (NVIDIA#9359) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> [None][infra] Updated Linux installation guide (NVIDIA#9485) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> [TRTLLM-9075][doc] refine the slurm examples (NVIDIA#9548) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> [TRTLLM-9093][doc] update hyper links in overview (NVIDIA#9568) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> [TRTLLM-9092][doc] link to modelopt checkpoints in quick start guide (NVIDIA#9571) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> [None][infra] Check in most recent lock file from nightly pipeline Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com> [None][fix] Fix triton moe load_weight (NVIDIA#9649) Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> [None][fix] fix a bug: deepseek_fp8_block_scales in TRTLLMGEN-MoE use 2D x_sf instead of 1D (NVIDIA#9658) Signed-off-by: xxi <xxi@nvidia.com> [TRTLLM-9372][feat] Enable CuteDSL MoE with Large EP (NVIDIA#9592) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> [TRTLLM-9522][chore] implement default `attach_multimodal_embeddings` (NVIDIA#9664) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com> [TRTLLM-9660][feat] Convert cuteDSL GEMM to opt-in feature (NVIDIA#9682) Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> [None][fix] enable hmac in RPC (NVIDIA#9745) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> [None][infra] Check in most recent lock file from nightly pipeline Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com> [https://nvbugs/5703953][fix] Preserving ip:port for trtllm-serve before initializing llm (NVIDIA#9646) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> [None][infra] Waive failed cases for main branch on 12/07 (NVIDIA#9769) Signed-off-by: qqiao <qqiao@nvidia.com> [None][fix] Several minor fixes to CI setting (NVIDIA#9765) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> [OMNIML-3036][doc] Re-branding TensorRT-Model-Optimizer as Nvidia Model-Optimizer (NVIDIA#9679) Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> [None][feat] Enable NCCL_SYMMETRIC as default fallback for AllReduce (NVIDIA#9314) Signed-off-by: Ludwig Schneider <lschneider@nvidia.com> [TRTLLM-9000][feat] Add multi-node Perf Tests into CI (NVIDIA#8800) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> [None][test] add ntp tolerance in time metrics verification (NVIDIA#9741) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com> [TRTLLM-9603][feat] Enable ConfigurableMoE test in the CI (NVIDIA#9645) [https://nvbugs/5422621][test] Add GB 200 WIDEEP test case for RCCA 5422621 (NVIDIA#9506) Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com> [None][fix] Fix two tuning cache miss issues. (NVIDIA#9743) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> [None][infra] Check in most recent lock file from nightly pipeline Signed-off-by: TensorRT LLM <90828364+tensorrt-cicd@users.noreply.github.com> [TRTLLM-9706] [doc] Update wide EP documents (NVIDIA#9724) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> [https://nvbugs/5666804][test] only adding sampler config for limited models (NVIDIA#9512) Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com> Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com> Co-authored-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com> Co-authored-by: Larry Xu <197874197+LarryXFly@users.noreply.github.com> [None][infra] Waive failed cases for main on 12/08 (NVIDIA#9773) Signed-off-by: qqiao <qqiao@nvidia.com> [None][chore] Move the rocketkv e2e test to post-merge (NVIDIA#9768) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> [None][chore] Enable tvm_ffi for cute dsl nvfp4_gemm to reduce host overhead. (NVIDIA#9690) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com> [TRTLLM-9431][perf] Enable multistream for Linear Attention in Qwen3-… (NVIDIA#9696) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> [None][chore] Remove closed bugs (NVIDIA#9770) Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> [None][infra] update mooncake in docker images (NVIDIA#9584) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com> Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com> [None][test] Add Kimi k2 WIDEEP perf and accuracy cases (NVIDIA#9686) Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com> Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> [https://nvbugs/5527655][test] Add test case for RCCA 5527655 (NVIDIA#9511) Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com> [http://nvbugs/5649010][fix] fix test_auto_scaling.py::test_worker_restart timeout (NVIDIA#9775) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> [None][fix] Switch AutoDeploy's default allreduce strategy to NCCL (NVIDIA#9666) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> [TRTLLM-9506][fix] Fix AR for DeepSeek-R1 2 model path (NVIDIA#9661) Signed-off-by: qgai <qgai@nvidia.com> ray + updatew works trtllm works in async env trtllm works in sync and async env ray + updatew works rebase to the updated verl server mode still cherry pick still cherry pick still cherry pick integrated http interface hang at RyExecutor create workers ray.remote clean code use tensorrt_llm.rlhf_utils Signed-off-by: Liwei Ma <liweim@nvidia.com> placement, asyncllm, and basic tests Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> connect sleep and wakeup; Add support to pass None to update_weights Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Batching ctx for IFB scheduler Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> accuracy WAR for TP>1: always use AllReduceStrategy.NCCL, refactored Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> fix e2e integration Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> update asyncllm, other nits Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> fix init setup Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Fix TRTLLMSampler logprobs perf Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> fix and cleanup Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> fix server Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Revert "Batching ctx for IFB scheduler" This reverts commit b51aac0 Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> update & address comments Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv changed the title ~~User/brb/integrate helix on main redo mr~~ [None][feat] Integrate helix parallelism Nov 20, 2025

brb-nv mentioned this pull request Nov 20, 2025

[None][feat] Integrate helix on main #8894

Closed

1 task

brb-nv commented Nov 21, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py Outdated Show resolved Hide resolved

brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch 2 times, most recently from 812dfb9 to 50436a1 Compare November 21, 2025 17:43

brb-nv marked this pull request as ready for review November 21, 2025 17:51

brb-nv requested review from a team as code owners November 21, 2025 17:51

brb-nv requested review from MatthiasKohl, Shixiaowei02, hlu1, laikhtewari and syuoni November 21, 2025 17:51

coderabbitai bot reviewed Nov 21, 2025

View reviewed changes

tensorrt_llm/_torch/attention_backend/trtllm.py Show resolved Hide resolved

tensorrt_llm/_torch/distributed/communicator.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/distributed/communicator.py Outdated Show resolved Hide resolved

brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch from 6f7ffc7 to ec20a04 Compare November 22, 2025 04:04

brb-nv requested a review from a team as a code owner November 23, 2025 01:17

brb-nv requested a review from chuangz0 November 23, 2025 01:17

brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch 4 times, most recently from ec9faa5 to 7eabb38 Compare November 23, 2025 03:50

syuoni reviewed Nov 25, 2025

View reviewed changes

chuangz0 approved these changes Nov 25, 2025

View reviewed changes

MatthiasKohl approved these changes Nov 28, 2025

View reviewed changes

[TRTLLM-5971][feat] Integrate Helix Parallelism

98051c3

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv force-pushed the user/brb/integrate-helix-on-main-redo-mr branch from 83d6416 to 98051c3 Compare November 28, 2025 20:49

FrankD412 approved these changes Nov 29, 2025

View reviewed changes

FrankD412 previously requested changes Nov 29, 2025

View reviewed changes

tensorrt_llm/bench/benchmark/low_latency.py Show resolved Hide resolved

brb-nv requested a review from FrankD412 November 29, 2025 21:52

brb-nv merged commit b77f4ff into NVIDIA:main Nov 29, 2025
5 checks passed

brb-nv mentioned this pull request Nov 30, 2025

[None][chore] Defer exposing context parallel configs #9552

Merged

1 task

brb-nv mentioned this pull request Dec 4, 2025

[None][doc] Add feature docs for helix parallelism #9684

Merged

1 task

codego7250 pushed a commit to codego7250/TensorRT-LLM that referenced this pull request Dec 11, 2025

[TRTLLM-5971][feat] Integrate helix parallelism (NVIDIA#9342)

57b8608

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

[TRTLLM-5971][feat] Integrate helix parallelism #9342

[TRTLLM-5971][feat] Integrate helix parallelism #9342

Uh oh!

Conversation

brb-nv commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Release Notes

Uh oh!

Uh oh!

coderabbitai bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chuangz0 left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Nov 27, 2025

Uh oh!

tensorrt-cicd commented Nov 28, 2025

Uh oh!

MatthiasKohl left a comment

Choose a reason for hiding this comment

Uh oh!

brb-nv commented Nov 28, 2025

Uh oh!

tensorrt-cicd commented Nov 28, 2025

Uh oh!

brb-nv commented Nov 28, 2025

Uh oh!

tensorrt-cicd commented Nov 28, 2025

Uh oh!

tensorrt-cicd commented Nov 28, 2025

Uh oh!

brb-nv commented Nov 29, 2025

Uh oh!

tensorrt-cicd commented Nov 29, 2025

Uh oh!

tensorrt-cicd commented Nov 29, 2025

Uh oh!

Uh oh!

tensorrt-cicd commented Nov 29, 2025

Uh oh!

brb-nv commented Nov 29, 2025

Uh oh!

tensorrt-cicd commented Nov 29, 2025

Uh oh!

tensorrt-cicd commented Nov 29, 2025

Uh oh!

brb-nv commented Nov 29, 2025

Uh oh!

tensorrt-cicd commented Nov 29, 2025

Uh oh!

tensorrt-cicd commented Nov 29, 2025

Uh oh!

Uh oh!

brb-nv commented Nov 30, 2025

Uh oh!

brb-nv commented Nov 20, 2025 •

edited

Loading

coderabbitai bot commented Nov 21, 2025 •

edited

Loading