Skip to content

Conversation

@ChristinaZ
Copy link
Collaborator

@ChristinaZ ChristinaZ commented Sep 16, 2025

Summary by CodeRabbit

  • New Features
    • Expanded MoE routing to support top-k up to 10 and up to 512 experts.
    • Configurable per-kernel maximum experts with improved group-aware top-k selection.
    • Enhanced routing variants for DeepSeek V3 and Llama 4 scenarios.
  • Performance
    • Reworked kernels and reductions for faster, more scalable top-k routing and normalization.
    • Conditional fuse gating in DeepSeek V3 to optimize routing based on runtime shapes.
  • Refactor
    • Updated routing API to use bias instead of prior auxiliary inputs; streamlined launch paths.
  • Tests
    • Added extensive unit tests covering larger expert counts, top-k=10, and new routing paths.

Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for KIMI K2 and Qwen-next (MoE TRTLLM backend)

We fused the PyTorch OPs for the routing part of Deepseel V3, and now we optimize its performance like what we have done for the MoE TRTLLM backend.

Also add support for KIMI K2 and Qwen-next (MoE TRTLLM backend)

Test Coverage

pytest -s tests/unittest/_torch/thop/parallel/test_noaux_tc.py
pytest -v -s tests/unittest/_torch/thop/parallel/test_moe.py

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@ChristinaZ ChristinaZ force-pushed the feat_large_experts_moe_trtllm branch from 8813980 to d9061c9 Compare October 1, 2025 14:09
@ChristinaZ ChristinaZ force-pushed the feat_large_experts_moe_trtllm branch from d51729b to 0614625 Compare October 13, 2025 14:27
@ChristinaZ ChristinaZ self-assigned this Oct 13, 2025
@ChristinaZ ChristinaZ marked this pull request as ready for review October 13, 2025 14:31
@ChristinaZ ChristinaZ requested review from a team as code owners October 13, 2025 14:31
@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21230 [ run ] triggered by Bot

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 13, 2025

📝 Walkthrough

Walkthrough

Refactors and generalizes MoE routing/top‑K kernels to support larger expert counts and top‑k up to 10, adds generic warp top‑K reductions, and parameterizes launches by per‑kernel MaxNumExperts. Public API changes include noAuxTc switching from group_scores/scores_with_bias to bias. Tests and PyTorch integration updated accordingly.

Changes

Cohort / File(s) Summary
MoE Top‑K utilities
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh, cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh
Adds generic warp-level reduceTopK overloads (including N>4 chunking), renames legacy reduceTopK→reduceTopKFunc, tweaks comparison key construction, and exposes MaxNumExpertsUnit/MaxNumTopK constants.
NoAuxTc pathway (kernel, header, THOP, tests)
cpp/tensorrt_llm/kernels/noAuxTcKernels.cu, cpp/tensorrt_llm/kernels/noAuxTcKernels.h, cpp/tensorrt_llm/thop/noAuxTcOp.cpp, tests/unittest/_torch/thop/parallel/test_noaux_tc.py
Reworks kernel to use new reduceTopK utilities and group-based scoring; public API changes to accept bias instead of group_scores/scores_with_bias; updates Torch op signature, invocation, and tests to new parameters.
Routing kernels parameterization (DeepSeek/Llama4/Renormalize + core params)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu, .../RoutingLlama4.cu, .../RoutingRenormalize.cu, .../RoutingKernel.cuh, .../RoutingKernel.h, .../blockScaleMoe/DevKernel.h
Generalizes kernels to per‑kernel MaxNumExperts; introduces getMaxNumExperts, new launch macros (LAUNCH_ROUTING_*), dynamic launch_bounds, shared memory sizing, and optional softmax-after-topK flag in params. Adjusts bounds (experts/groups) and flow.
PyTorch MoE frontends
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp, cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp, tests/unittest/_torch/thop/parallel/test_moe.py
Expands num_experts limit (to 512 in fp4 path) and raises top_k cap to 10; augments test matrices and assertions to reflect new limits and Renormalize routing variants.
Python DeepSeekV3 model integration
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Gates fused routing by shape/resource thresholds; updates fused call to new noAuxTc API (logits + bias). Defers score computation when fused path is used.
C++ unit tests (routing)
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp, cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp
Adds/updates numerous scenarios to cover larger expert counts (up to 512), new topK up to 10, varied parallelization modes, and histogram-based paths.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant PT as PyTorch Op (noaux_tc_op)
  participant K as invokeNoAuxTc
  participant G as deepseek_v3_topk_kernel
  participant R as reduceTopK(…)
  participant HW as GPU Warps

  PT->>K: scores, bias, n_group, topk_group, topk, routed_scaling_factor
  K->>G: launch with params (groups, topk, bias)
  Note over G: Compute per-expert scores (+bias)<br/>Group mapping and partial reductions
  loop per warp/group
    G->>R: reduceTopK (values, indices, K)
    R->>HW: warp-level top‑K selection
    HW-->>R: top‑K per warp
  end
  G-->>K: topk_values, topk_indices
  K-->>PT: return tensors
Loading
sequenceDiagram
  autonumber
  participant Run as Routing::run()
  participant Sel as getMaxNumExperts()
  participant LM as LAUNCH_ROUTING_* macro
  participant KR as Kernel (Histogram/Offsets/Block/Cluster)
  Note over Run: RoutingDeepSeek/Llama4/Renormalize
  Run->>Sel: numExperts
  Sel-->>Run: MaxNumExperts (e.g., 128/256/384/512)
  Run->>LM: choose launch variant (cooperative/grouped/etc.)
  LM->>KR: launch __launch_bounds__(KernelParams::MaxNumExperts)
  Note over KR: Uses dynamic MaxNumExperts for indexing,<br/>shared mem, and reductions
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.87% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The pull request description does not follow the repository’s template because it omits the required @coderabbitai summary tag and lacks a dedicated “## Description” section, instead starting with a custom header and leaving template instructions in place. Revise the PR description to include the @coderabbitai summary directive, add a “## Description” section that clearly explains the problem and solution, and remove any leftover template instructional comments.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly summarizes the two main changes in the pull request—optimizing the DeepseekV3 routing kernel and adding support for 384 experts—using the correct ticket format and a concise, descriptive phrasing that reflects the PR objectives.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)

575-581: Fix undefined variable in debug path (scores_with_bias).

scores_with_bias is used before being defined. Compute it from logits/bias in debug mode.

Apply this diff:

-        if enable_llm_debug():
-            has_nan = torch.isnan(scores_with_bias).any()
-            if has_nan:
-                warnings.warn(
-                    "Detected NAN in the tensor scores_with_bias. Please check if it matches the expectation."
-                )
+        if enable_llm_debug():
+            debug_scores_with_bias = torch.sigmoid(logits) + e_score_correction_bias
+            if torch.isnan(debug_scores_with_bias).any():
+                warnings.warn(
+                    "Detected NaN in scores_with_bias (debug check)."
+                )

1-27: Add NVIDIA Apache‑2.0 header (retain attribution).

File lacks the required NVIDIA Apache‑2.0 header. Keep the DeepSeek attribution block below it.

Apply this diff (update year as needed):

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # --------------------------------------------------
 # Portions of this code were derived from DeepSeek‑V3:
 #   https://github.com/deepseek-ai/DeepSeek-V3

As per coding guidelines

cpp/tensorrt_llm/thop/noAuxTcOp.cpp (1)

81-85: Use TORCH_CHECK instead of throwing exceptions

Avoid throwing across the extension boundary. Prefer TORCH_CHECK for invalid dtype.

-        throw std::invalid_argument("Invalid dtype, only supports float16, float32, and bfloat16");
+        TORCH_CHECK(false, "Invalid dtype, only supports float16, float32, and bfloat16");

As per coding guidelines

cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (2)

65-69: Fix potential aliasing UB in unpack()

Reinterpreting a 64‑bit value as 32‑bit UnsignedBits by reference is unsafe. Use a width‑matched temporary.

-        auto compactTmp = cmp >> kMoveBits;
-        auto valueBits
-            = cub::Traits<T>::TwiddleOut(reinterpret_cast<typename cub::Traits<T>::UnsignedBits&>(compactTmp));
-        value = reinterpret_cast<T&>(valueBits);
+        auto compactTmp = cmp >> kMoveBits;
+        using UBits = typename cub::Traits<T>::UnsignedBits;
+        UBits ubits = static_cast<UBits>(compactTmp);
+        UBits twiddled = cub::Traits<T>::TwiddleOut(ubits);
+        value = reinterpret_cast<T&>(twiddled);

As per coding guidelines


184-192: Misleading static_assert message

N is constrained to N < 5, but message claims “<= 128”. Fix message to avoid confusion.

-    static_assert(N < 5, "Only support candidates number less than or equal to 128");
+    static_assert(N < 5, "Only support up to 4 candidates per thread (N <= 4)");
🧹 Nitpick comments (21)
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (1)

118-130: Consider extracting magic numbers to named constants.

The validation checks use several magic number literals (8, 10, 4) for top_k and topk_group bounds. Per coding guidelines, prefer named constants over magic literals for comparisons and logic.

Consider refactoring to use named constants:

namespace {
constexpr int32_t kMAX_TOP_K_GROUPED = 8;
constexpr int32_t kMAX_TOP_K_NO_GROUP = 10;
constexpr int32_t kMAX_TOPK_GROUP = 4;
} // namespace

Then update the checks:

-        TORCH_CHECK(top_k <= 8 && top_k > 0, "Current routing kernel (with groups) only supports top_k<=8 && top_k>0.");
+        TORCH_CHECK(top_k <= kMAX_TOP_K_GROUPED && top_k > 0, "Current routing kernel (with groups) only supports top_k<=8 && top_k>0.");
         TORCH_CHECK(topk_group.value() <= 4 && topk_group.value() > 0,
-            "Current routing kernel only (with groups) supports topk_group<=4 && topk_group > 0.");
+        TORCH_CHECK(topk_group.value() <= kMAX_TOPK_GROUP && topk_group.value() > 0,
+            "Current routing kernel only (with groups) supports topk_group<=4 && topk_group > 0.");
     }
     else if (static_cast<RoutingMethodType>(routing_method_type) == RoutingMethodType::Renormalize
         || static_cast<RoutingMethodType>(routing_method_type) == RoutingMethodType::RenormalizeNaive)
     {
-        TORCH_CHECK(top_k <= 10 && top_k > 0,
+        TORCH_CHECK(top_k <= kMAX_TOP_K_NO_GROUP && top_k > 0,
             "Current routing kernel (no groups, renormalize) only supports top_k<=10 && top_k>0.");

As per coding guidelines.

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)

2-2: Update copyright year to 2025.

The copyright header should include the current year (2025) per coding guidelines.

As per coding guidelines.

Apply this diff:

- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

568-570: Prefer torch.sigmoid over F.sigmoid.

F.sigmoid is deprecated; use torch.sigmoid.

Apply this diff:

-        scores = F.sigmoid(logits)
+        scores = torch.sigmoid(logits)
tests/unittest/_torch/thop/parallel/test_noaux_tc.py (1)

77-80: Remove commented-out debug code.

The commented-out print statements should be removed entirely rather than left in the codebase as they serve no purpose and reduce code cleanliness.

-    # print(sorted_selected_values)
-    # print(ref_sorted_selected_values)
-    # print(selected_indices)
-    # print(ref_selected_indices)
     # compare
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh (5)

37-38: Constant naming/style per coding guidelines

Rename constants to k-prefixed UPPER_SNAKE_CASE and consider placing in a single header. Example: kMAX_NUM_EXPERTS_UNIT, kMAX_NUM_TOPK.

As per coding guidelines


189-190: Incorrect static_assert message for N constraint

Message says “<= 128” but the code requires N < 5 (per-thread candidates ≤ 4). Fix the message for clarity.

Apply this diff:

-    static_assert(N < 5, "Only support candidates number less than or equal to 128");
+    static_assert(N < 5, "Only support per-thread candidate count N <= 4");

Based on learnings


223-224: Inconsistent static_assert message for N limit

You assert N <= 16 but message references 16*32=512. Clarify that N is per-thread.

Apply this diff:

-    static_assert(N <= 16, "Only support candidates number less than or equal to 16*32=512");
+    static_assert(N <= 16, "Only support per-thread candidates N <= 16");

Based on learnings


240-244: Fragile/unnecessary idx initialization pattern

Initializing topKBufferIdx with ii*WarpSize - 1 risks negative tie-breakers affecting compaction; use -1 directly for clarity.

Apply this diff:

-        for (int ii = 0; ii < numResults; ++ii)
-        {
-            topKBufferValue[ii] = minValue;
-            topKBufferIdx[ii] = ii * WarpSize - 1; //@todo: check if this is correct
-        }
+        for (int ii = 0; ii < numResults; ++ii)
+        {
+            topKBufferValue[ii] = minValue;
+            topKBufferIdx[ii] = -1;
+        }

163-180: actualK handling

Remove TODO and clamp actualK to [1, K] to avoid surprises.

Apply this diff:

-    for (int kk = 0; kk < actualK; ++kk) //@todo: check if actualK is correct
+    actualK = max(0, min(actualK, K));
+    for (int kk = 0; kk < actualK; ++kk)
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (1)

269-277: numTokens comment mismatch

Comment says 10, value is 1024. Align comment or value to avoid confusion.

cpp/tensorrt_llm/kernels/noAuxTcKernels.h (2)

1-16: Update copyright year

File headers should include current year (2025).

Apply this diff:

- * Copyright (c) 2019-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2019-2025, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines


18-33: Prefer include guards over pragma once

Replace #pragma once with include guards TRTLLM_NOAUXTCKERNELS_H for consistency.

As per coding guidelines

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (1)

539-549: BlockScan template parameterization

BlockScan<int32_t, KernelParams::MaxNumExperts> assumes MaxNumExperts is within CUB limits and matches blockDim.x. If future instantiations use 384+, this becomes expensive. Consider warp scans + CTA reduction if needed.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (2)

41-66: Dead typedef and comment cleanup

DataTypeVec alias is unused; remove or use for vectorized loads later.


473-485: getMaxNumExperts() only supports <=128

This hard-caps Llama4 path; that’s fine if by design. Add a brief comment referencing the 128‑expert limit to avoid confusion with 384‑expert support elsewhere.

cpp/tensorrt_llm/thop/noAuxTcOp.cpp (1)

55-55: Stream from scores is fine; consider const-correct data pointers

Not a blocker, but these inputs are read-only. If feasible, change kernel signatures to accept const T* and use data_ptr() here.

cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1)

165-182: Single‑value reduceTopK: tie‑break relies on kMaxIdx; document or assert idx range

If idx can exceed 65535 in any caller, tie‑break fails. Consider static_assert or comment the constraint.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (2)

172-176: Intermediate top‑K staging looks sound; minor clarity nits

Logic matches the generic reduce path. Consider naming NumInterTopKPerThread -> kNumInterTopKPerThread for const and add a brief comment on layout.

Also applies to: 184-226


631-636: Dynamic threads selection

numThreadsMain chooses 256 vs 384. Consider using getMaxNumExperts(data.mNumExperts) for consistency with histogram threads.

cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (1)

239-253: Avoid magic epsilon; name it

Replace 1e-20 with a named constant to document intent and avoid magic numbers.

-        auto finalScore = static_cast<OutputT>(scoreNorm * routedScalingFactor / (redNorm + 1e-20));
+        constexpr float kEPS = 1e-20f;
+        auto finalScore = static_cast<OutputT>(scoreNorm * routedScalingFactor / (redNorm + kEPS));

As per coding guidelines

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (1)

407-422: Macro name typo: LAUNCH_ROUTING_RENORNALIZE

Consider renaming to LAUNCH_ROUTING_RENORMALIZE for consistency and grep‑ability across files.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ff9fa6 and 0614625.

📒 Files selected for processing (18)
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (4 hunks)
  • cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (1 hunks)
  • cpp/tensorrt_llm/kernels/noAuxTcKernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h (2 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (15 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (11 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h (6 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh (3 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (10 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (9 hunks)
  • cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (2 hunks)
  • cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (1 hunks)
  • cpp/tensorrt_llm/thop/noAuxTcOp.cpp (2 hunks)
  • cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (5 hunks)
  • cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (1 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (2 hunks)
  • tests/unittest/_torch/thop/parallel/test_moe.py (5 hunks)
  • tests/unittest/_torch/thop/parallel/test_noaux_tc.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...

Files:

  • cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp
  • cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh
  • cpp/tensorrt_llm/kernels/noAuxTcKernels.h
  • cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
  • cpp/tensorrt_llm/thop/noAuxTcOp.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
  • cpp/tensorrt_llm/kernels/noAuxTcKernels.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.

Files:

  • cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp
  • cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh
  • cpp/tensorrt_llm/kernels/noAuxTcKernels.h
  • cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
  • cpp/tensorrt_llm/thop/noAuxTcOp.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
  • cpp/tensorrt_llm/kernels/noAuxTcKernels.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp
  • tests/unittest/_torch/thop/parallel/test_moe.py
  • cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh
  • tests/unittest/_torch/thop/parallel/test_noaux_tc.py
  • cpp/tensorrt_llm/kernels/noAuxTcKernels.h
  • cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
  • cpp/tensorrt_llm/thop/noAuxTcOp.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • cpp/tensorrt_llm/kernels/noAuxTcKernels.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.

Files:

  • cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp
  • cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h
  • cpp/tensorrt_llm/kernels/noAuxTcKernels.h
  • cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
  • cpp/tensorrt_llm/thop/noAuxTcOp.cpp
  • cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp
  • tests/unittest/_torch/thop/parallel/test_moe.py
  • cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh
  • tests/unittest/_torch/thop/parallel/test_noaux_tc.py
  • cpp/tensorrt_llm/kernels/noAuxTcKernels.h
  • cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
  • cpp/tensorrt_llm/thop/noAuxTcOp.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • cpp/tensorrt_llm/kernels/noAuxTcKernels.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tests/unittest/_torch/thop/parallel/test_moe.py
  • tests/unittest/_torch/thop/parallel/test_noaux_tc.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
**/*.{h,hpp,hh,hxx}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h
  • cpp/tensorrt_llm/kernels/noAuxTcKernels.h
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
**/*.{h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh
  • cpp/tensorrt_llm/kernels/noAuxTcKernels.h
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh
  • cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
🧠 Learnings (2)
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
PR: NVIDIA/TensorRT-LLM#7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

  • cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh
  • cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
🧬 Code graph analysis (11)
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
  • top_k (233-233)
tests/unittest/_torch/thop/parallel/test_moe.py (5)
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (8)
  • param (47-152)
  • param (47-47)
  • param (154-165)
  • param (154-154)
  • param (167-176)
  • param (167-167)
  • param (203-209)
  • param (203-204)
cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (6)
  • param (39-144)
  • param (39-39)
  • param (146-152)
  • param (146-146)
  • param (184-190)
  • param (184-185)
cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp (2)
  • param (39-91)
  • param (39-39)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
  • RoutingMethodType (39-124)
tensorrt_llm/_torch/modules/fused_moe/routing.py (1)
  • RoutingMethodType (143-155)
cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (1)
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (8)
  • param (47-152)
  • param (47-47)
  • param (154-165)
  • param (154-154)
  • param (167-176)
  • param (167-167)
  • param (203-209)
  • param (203-204)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (6)
  • routingTopKExperts (36-39)
  • void (83-240)
  • void (316-319)
  • void (325-384)
  • getMaxNumExperts (388-403)
  • getMaxNumExperts (388-388)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (2)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (5)
  • __launch_bounds__ (402-402)
  • getMaxNumExperts (474-485)
  • getMaxNumExperts (474-474)
  • routingIndicesClusterKernel (312-390)
  • routingIndicesClusterKernel (392-392)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (6)
  • __launch_bounds__ (83-83)
  • __launch_bounds__ (325-325)
  • getMaxNumExperts (388-403)
  • getMaxNumExperts (388-388)
  • routingIndicesClusterKernel (245-314)
  • routingIndicesClusterKernel (316-316)
cpp/tensorrt_llm/kernels/noAuxTcKernels.h (1)
cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (3)
  • void (44-258)
  • invokeNoAuxTc (261-306)
  • invokeNoAuxTc (261-263)
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (2)
  • top_k (233-233)
  • num_experts (226-226)
cpp/tensorrt_llm/thop/noAuxTcOp.cpp (1)
cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (2)
  • invokeNoAuxTc (261-306)
  • invokeNoAuxTc (261-263)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (14)
  • _ (59-75)
  • _ (79-81)
  • _ (84-87)
  • _ (90-95)
  • _ (106-107)
  • _ (110-122)
  • _ (125-130)
  • _ (133-138)
  • _ (141-146)
  • _ (149-161)
  • _ (164-170)
  • _ (173-174)
  • _ (177-180)
  • _ (184-185)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (4)
  • num_experts (226-226)
  • n_group (234-234)
  • top_k (233-233)
  • topk_group (236-236)
cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (3)
  • void (35-277)
  • void (301-304)
  • void (500-503)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (2)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (5)
  • void (35-277)
  • void (301-304)
  • void (500-503)
  • getMaxNumExperts (506-525)
  • getMaxNumExperts (506-506)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (6)
  • void (70-307)
  • void (392-395)
  • void (402-471)
  • getMaxNumExperts (474-485)
  • getMaxNumExperts (474-474)
  • routingIndicesHistogramScoresKernel (402-402)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (24)
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (1)

129-130: LGTM! Verify asymmetry is intentional.

The top_k limit increase from 8 to 10 for no-group Renormalize routing is correctly implemented, with both the condition and error message updated consistently.

Note that grouped routing (line 118) retains the top_k <= 8 limit while no-group routing now allows top_k <= 10. This asymmetry appears intentional per the PR objectives, but please confirm this is by design and that the underlying kernel implementations support these different limits.

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (2)

133-133: Define and use a named constant for max experts and verify kernel support
Introduce in fp4BlockScaleMoe.cpp:

namespace {
    constexpr int32_t kMAX_NUM_EXPERTS = 512;
} // namespace

Then replace the check:

- TORCH_CHECK(num_experts <= 512, "num_experts must be less than or equal to 512");
+ TORCH_CHECK(num_experts <= kMAX_NUM_EXPERTS,
+     "num_experts must be ≤ " + std::to_string(kMAX_NUM_EXPERTS));

No existing device-code limit for 512 was found—manually confirm that all KernelParams::MaxNumExperts instantiations and __launch_bounds__ accommodate 512 experts.


205-205: Replace magic literal 256 * 2 with a named constant and document its rationale
Define a kMinExpertCountHistogramSize and add a comment explaining why the fallback minimum remains 512 instead of scaling with num_experts * 2.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)

632-634: Ignore custom op binding mismatch warning Stub and C++ registrations both accept (scores, scores_with_bias, n_group, topk_group, topk, routed_scaling_factor) matching the call’s six arguments; no change needed.

Likely an incorrect or invalid review comment.


582-590: Avoid mutating self.is_fused; use a per-call fused_enabled flag

  • Mutating self.is_fused persists fusion state across calls—introduce a local fused_enabled in forward() and update it instead.
  • Confirm whether the fused kernels now support top_k > 8 (e.g., up to 10) and adjust the threshold accordingly.
tests/unittest/_torch/thop/parallel/test_moe.py (3)

878-878: LGTM! Configuration aligns with PR objectives.

Adding the (384, 1, 1, 8) configuration properly expands test coverage for the 384-expert support mentioned in the PR objectives for KIMI K2 and Qwen-next models.


1061-1072: LGTM! Test coverage for Qwen-next routing added.

The new test case with 512 experts and top_k=10 correctly validates the expanded routing capabilities for the Renormalize method used by Qwen-next models.


1145-1156: LGTM! Consistent test coverage in no-autotune path.

The duplicate test case in the no-autotune section ensures consistent validation of 512 experts with top_k=10 across both autotuned and non-autotuned code paths.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h (3)

120-134: LGTM! Llama4-specific optimization with fixed expert count.

The hardcoded 128 value for Llama4 routing is a valid compile-time optimization. The comment clearly indicates this is intentional and specific to the Llama4 routing method.


175-200: LGTM! Simplified macro variant without forceFloatInput.

The new LAUNCH_ROUTING_WITH_NUM_EXPERTS macro provides a cleaner interface for cases where forceFloatInput is not needed, properly threading the numExperts parameter through all dispatch paths.


136-171: Verified numExperts consistency – all LAUNCH_ROUTING_WITH_NUM_EXPERTS_FORCE_FLOAT_INPUT calls pass the exact compile-time constants (topk::MaxNumExpertsUnit, NumDeepseekExperts, NumKimiK2Experts) returned by getMaxNumExperts and matching the KernelParams::MaxNumExperts template parameter.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h (2)

108-113: LGTM! Per-kernel expert count parameterization introduced.

The addition of MaxNumExperts_ as a template parameter enables compile-time specialization based on expert count, which should improve performance for fixed-size configurations like 384 experts.


183-184: LGTM! Consistent template parameter propagation.

All three routing kernel param specializations (routingDeepSeek, routingLlama4, routingRenormalize) correctly inherit the new MaxNumExperts_ parameter from KernelParamsBase and propagate it through their template hierarchies.

Also applies to: 242-243, 284-285

tests/unittest/_torch/thop/parallel/test_noaux_tc.py (1)

11-11: LGTM! Test coverage for 384 experts added.

The addition of (384, 1, 1, 8) to the test parameterization properly validates the 384-expert support mentioned in the PR objectives.

cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (1)

305-336: LGTM! Comprehensive test coverage for large expert counts added.

The three new test cases (BlockLevelParallelizationLargeN, ClusterLevelParallelizationLargeN, DeviceLevelParallelizationLargeN) provide thorough validation of the 512-expert, top_k=10 configuration across different parallelization strategies, ensuring correctness at multiple scaling levels.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (2)

438-469: Block stride assumes blockDim.x == KernelParams::MaxNumExperts

gridBlockOffset/gridStride use KernelParams::MaxNumExperts. Ensure all launches set blockDim.x to match the KernelParams instantiation; otherwise, indexing is off.

Consider asserting at runtime in launcher or documenting this invariant.


598-605: Last-warp election relies on exact warp count

Condition uses KernelParams::MaxNumExperts / WarpSize - 1. Ensure MaxNumExperts is a multiple of WarpSize (it appears intended). If not, guard with ceil division or an explicit assert.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (1)

489-510: Explicit constraints checks look good

TopK<=1, experts<=128, experts%4==0, padding<8. LGTM.

Ensure LAUNCH_ROUTING_LLAMA4 instantiates KernelParams with MaxNumExperts==numThreadsHist (128) so launch_bounds and shared array sizes match the launch.

cpp/tensorrt_llm/kernels/noAuxTcKernels.h (1)

28-32: Public API change verified: all invokeNoAuxTc calls and explicit instantiations now include the new bias parameter and match the updated signature.

cpp/tensorrt_llm/thop/noAuxTcOp.cpp (1)

61-79: Casting pattern is OK; ensure contiguity

Ensure scores/bias/topk tensors are contiguous to match kernel assumptions.

Run at-call-site or add:

+    TORCH_CHECK(scores.is_contiguous() && bias.is_contiguous()
+        && topk_values.is_contiguous() && topk_indices.is_contiguous(),
+        "All tensors must be contiguous");

As per coding guidelines

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (1)

575-594: Top‑K bounds checks

Good guardrails for MaxNumTopExperts and warp constraints. Confirm topk::MaxNumTopK aligns with test matrices (K up to 8).

cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (1)

18-37: Static constants: scope and naming OK

WARP_SIZE/MaxNumTop* constants conform to guidelines. LGTM.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (2)

24-34: Top‑K increased to 10; ensure downstream buffers account for 10

Confirm all packed Top‑K paths allocate 10 (weights/ids). Looks consistent here.


488-507: Histogram/offset launches use getMaxNumExperts

Good use of per‑kernel MaxNumExperts. LGTM.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21230 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #16023 completed with status: 'FAILURE'

@kaiyux kaiyux changed the title [None][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for 384 experts (MoE TRTLLM backend) [TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for 384 experts (MoE TRTLLM backend) Oct 14, 2025
@ChristinaZ ChristinaZ force-pushed the feat_large_experts_moe_trtllm branch 2 times, most recently from 325fa13 to 52c02ef Compare October 16, 2025 04:20
Copy link
Collaborator

@MatthiasKohl MatthiasKohl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I just had a question about the structure, but this should not hold the PR back, and can always be addressed in a future PR (if it even needs to be addressed).

@ChristinaZ ChristinaZ force-pushed the feat_large_experts_moe_trtllm branch from 52c02ef to 4dc2944 Compare October 16, 2025 14:28
@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21587 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21587 [ run ] completed with state DISABLED
L0 testing is limited to prioritized users. User ChristinaZ is not in the prioritized list. L0 testing cannot be triggered.

@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21649 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21649 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #16312 completed with status: 'FAILURE'

@ChristinaZ ChristinaZ force-pushed the feat_large_experts_moe_trtllm branch from f155ee3 to 7d74701 Compare October 17, 2025 09:17
@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21711 [ run ] completed with state SUCCESS. Commit: fc17459
/LLM/main/L0_MergeRequest_PR pipeline #16360 completed with status: 'FAILURE'

@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21739 [ run ] triggered by Bot. Commit: fc17459

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21739 [ run ] completed with state FAILURE. Commit: fc17459
/LLM/main/L0_MergeRequest_PR pipeline #16383 completed with status: 'FAILURE'

@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21743 [ run ] triggered by Bot. Commit: fc17459

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21743 [ run ] completed with state SUCCESS. Commit: fc17459
/LLM/main/L0_MergeRequest_PR pipeline #16386 completed with status: 'FAILURE'

@ChristinaZ
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21747 [ run ] triggered by Bot. Commit: fc17459

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21747 [ run ] completed with state SUCCESS. Commit: fc17459
/LLM/main/L0_MergeRequest_PR pipeline #16389 completed with status: 'FAILURE'

…TLASS

backend

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
@ChristinaZ ChristinaZ force-pushed the feat_large_experts_moe_trtllm branch from fc17459 to d2202e8 Compare October 18, 2025 14:29
@ChristinaZ
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21764 [ run ] triggered by Bot. Commit: d2202e8

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21764 [ run ] completed with state SUCCESS. Commit: d2202e8
/LLM/main/L0_MergeRequest_PR pipeline #16403 completed with status: 'FAILURE'

@ChristinaZ
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21781 [ run ] triggered by Bot. Commit: d2202e8

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21781 [ run ] completed with state SUCCESS. Commit: d2202e8
/LLM/main/L0_MergeRequest_PR pipeline #16419 completed with status: 'FAILURE'

@kaiyux
Copy link
Member

kaiyux commented Oct 19, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21797 [ run ] triggered by Bot. Commit: d2202e8

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21797 [ run ] completed with state SUCCESS. Commit: d2202e8
/LLM/main/L0_MergeRequest_PR pipeline #16431 completed with status: 'SUCCESS'

@ChristinaZ ChristinaZ merged commit c8b9998 into NVIDIA:main Oct 20, 2025
5 checks passed
govind-ramnarayan pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Oct 21, 2025
…UTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (NVIDIA#7761)

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Oct 24, 2025
…UTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (NVIDIA#7761)

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 1, 2025
…UTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (NVIDIA#7761)

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
…UTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (NVIDIA#7761)

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
…UTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (NVIDIA#7761)

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
…UTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (NVIDIA#7761)

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants