[TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for 384 experts (MoE TRTLLM backend) #7761

ChristinaZ · 2025-09-16T08:36:18Z

Summary by CodeRabbit

New Features
- Expanded MoE routing to support top-k up to 10 and up to 512 experts.
- Configurable per-kernel maximum experts with improved group-aware top-k selection.
- Enhanced routing variants for DeepSeek V3 and Llama 4 scenarios.
Performance
- Reworked kernels and reductions for faster, more scalable top-k routing and normalization.
- Conditional fuse gating in DeepSeek V3 to optimize routing based on runtime shapes.
Refactor
- Updated routing API to use bias instead of prior auxiliary inputs; streamlined launch paths.
Tests
- Added extensive unit tests covering larger expert counts, top-k=10, and new routing paths.

Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for KIMI K2 and Qwen-next (MoE TRTLLM backend)

We fused the PyTorch OPs for the routing part of Deepseel V3, and now we optimize its performance like what we have done for the MoE TRTLLM backend.

Also add support for KIMI K2 and Qwen-next (MoE TRTLLM backend)

Test Coverage

pytest -s tests/unittest/_torch/thop/parallel/test_noaux_tc.py
pytest -v -s tests/unittest/_torch/thop/parallel/test_moe.py

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

ChristinaZ · 2025-10-13T14:31:30Z

/bot run

tensorrt-cicd · 2025-10-13T14:37:17Z

PR_Github #21230 [ run ] triggered by Bot

coderabbitai · 2025-10-13T14:39:44Z

📝 Walkthrough

Walkthrough

Refactors and generalizes MoE routing/top‑K kernels to support larger expert counts and top‑k up to 10, adds generic warp top‑K reductions, and parameterizes launches by per‑kernel MaxNumExperts. Public API changes include noAuxTc switching from group_scores/scores_with_bias to bias. Tests and PyTorch integration updated accordingly.

Changes

Cohort / File(s)	Summary
MoE Top‑K utilities `cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh`	Adds generic warp-level reduceTopK overloads (including N>4 chunking), renames legacy reduceTopK→reduceTopKFunc, tweaks comparison key construction, and exposes MaxNumExpertsUnit/MaxNumTopK constants.
NoAuxTc pathway (kernel, header, THOP, tests) `cpp/tensorrt_llm/kernels/noAuxTcKernels.cu`, `cpp/tensorrt_llm/kernels/noAuxTcKernels.h`, `cpp/tensorrt_llm/thop/noAuxTcOp.cpp`, `tests/unittest/_torch/thop/parallel/test_noaux_tc.py`	Reworks kernel to use new reduceTopK utilities and group-based scoring; public API changes to accept bias instead of group_scores/scores_with_bias; updates Torch op signature, invocation, and tests to new parameters.
Routing kernels parameterization (DeepSeek/Llama4/Renormalize + core params) `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu`, `.../RoutingLlama4.cu`, `.../RoutingRenormalize.cu`, `.../RoutingKernel.cuh`, `.../RoutingKernel.h`, `.../blockScaleMoe/DevKernel.h`	Generalizes kernels to per‑kernel MaxNumExperts; introduces getMaxNumExperts, new launch macros (LAUNCH_ROUTING_), dynamic launch_bounds*, shared memory sizing, and optional softmax-after-topK flag in params. Adjusts bounds (experts/groups) and flow.
PyTorch MoE frontends `cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp`, `cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp`, `tests/unittest/_torch/thop/parallel/test_moe.py`	Expands num_experts limit (to 512 in fp4 path) and raises top_k cap to 10; augments test matrices and assertions to reflect new limits and Renormalize routing variants.
Python DeepSeekV3 model integration `tensorrt_llm/_torch/models/modeling_deepseekv3.py`	Gates fused routing by shape/resource thresholds; updates fused call to new noAuxTc API (logits + bias). Defers score computation when fused path is used.
C++ unit tests (routing) `cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp`, `cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp`	Adds/updates numerous scenarios to cover larger expert counts (up to 512), new topK up to 10, varied parallelization modes, and histogram-based paths.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant PT as PyTorch Op (noaux_tc_op)
  participant K as invokeNoAuxTc
  participant G as deepseek_v3_topk_kernel
  participant R as reduceTopK(…)
  participant HW as GPU Warps

  PT->>K: scores, bias, n_group, topk_group, topk, routed_scaling_factor
  K->>G: launch with params (groups, topk, bias)
  Note over G: Compute per-expert scores (+bias)<br/>Group mapping and partial reductions
  loop per warp/group
    G->>R: reduceTopK (values, indices, K)
    R->>HW: warp-level top‑K selection
    HW-->>R: top‑K per warp
  end
  G-->>K: topk_values, topk_indices
  K-->>PT: return tensors

sequenceDiagram
  autonumber
  participant Run as Routing::run()
  participant Sel as getMaxNumExperts()
  participant LM as LAUNCH_ROUTING_* macro
  participant KR as Kernel (Histogram/Offsets/Block/Cluster)
  Note over Run: RoutingDeepSeek/Llama4/Renormalize
  Run->>Sel: numExperts
  Sel-->>Run: MaxNumExperts (e.g., 128/256/384/512)
  Run->>LM: choose launch variant (cooperative/grouped/etc.)
  LM->>KR: launch __launch_bounds__(KernelParams::MaxNumExperts)
  Note over KR: Uses dynamic MaxNumExperts for indexing,<br/>shared mem, and reductions

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.87% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	⚠️ Warning	The pull request description does not follow the repository’s template because it omits the required `@coderabbitai summary` tag and lacks a dedicated “## Description” section, instead starting with a custom header and leaving template instructions in place.	Revise the PR description to include the `@coderabbitai summary` directive, add a “## Description” section that clearly explains the problem and solution, and remove any leftover template instructional comments.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly summarizes the two main changes in the pull request—optimizing the DeepseekV3 routing kernel and adding support for 384 experts—using the correct ticket format and a concise, descriptive phrasing that reflects the PR objectives.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)

575-581: Fix undefined variable in debug path (scores_with_bias).

scores_with_bias is used before being defined. Compute it from logits/bias in debug mode.

Apply this diff:

-        if enable_llm_debug():
-            has_nan = torch.isnan(scores_with_bias).any()
-            if has_nan:
-                warnings.warn(
-                    "Detected NAN in the tensor scores_with_bias. Please check if it matches the expectation."
-                )
+        if enable_llm_debug():
+            debug_scores_with_bias = torch.sigmoid(logits) + e_score_correction_bias
+            if torch.isnan(debug_scores_with_bias).any():
+                warnings.warn(
+                    "Detected NaN in scores_with_bias (debug check)."
+                )

1-27: Add NVIDIA Apache‑2.0 header (retain attribution).

File lacks the required NVIDIA Apache‑2.0 header. Keep the DeepSeek attribution block below it.

Apply this diff (update year as needed):

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # --------------------------------------------------
 # Portions of this code were derived from DeepSeek‑V3:
 #   https://github.com/deepseek-ai/DeepSeek-V3

As per coding guidelines

cpp/tensorrt_llm/thop/noAuxTcOp.cpp (1)

81-85: Use TORCH_CHECK instead of throwing exceptions

Avoid throwing across the extension boundary. Prefer TORCH_CHECK for invalid dtype.
-        throw std::invalid_argument("Invalid dtype, only supports float16, float32, and bfloat16");
+        TORCH_CHECK(false, "Invalid dtype, only supports float16, float32, and bfloat16");
As per coding guidelines

cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (2)

65-69: Fix potential aliasing UB in unpack()

Reinterpreting a 64‑bit value as 32‑bit UnsignedBits by reference is unsafe. Use a width‑matched temporary.

-        auto compactTmp = cmp >> kMoveBits;
-        auto valueBits
-            = cub::Traits<T>::TwiddleOut(reinterpret_cast<typename cub::Traits<T>::UnsignedBits&>(compactTmp));
-        value = reinterpret_cast<T&>(valueBits);
+        auto compactTmp = cmp >> kMoveBits;
+        using UBits = typename cub::Traits<T>::UnsignedBits;
+        UBits ubits = static_cast<UBits>(compactTmp);
+        UBits twiddled = cub::Traits<T>::TwiddleOut(ubits);
+        value = reinterpret_cast<T&>(twiddled);

As per coding guidelines

184-192: Misleading static_assert message

N is constrained to N < 5, but message claims “<= 128”. Fix message to avoid confusion.

-    static_assert(N < 5, "Only support candidates number less than or equal to 128");
+    static_assert(N < 5, "Only support up to 4 candidates per thread (N <= 4)");

🧹 Nitpick comments (21)

cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (1)
118-130: Consider extracting magic numbers to named constants.

The validation checks use several magic number literals (8, 10, 4) for top_k and topk_group bounds. Per coding guidelines, prefer named constants over magic literals for comparisons and logic.

Consider refactoring to use named constants:
namespace {
constexpr int32_t kMAX_TOP_K_GROUPED = 8;
constexpr int32_t kMAX_TOP_K_NO_GROUP = 10;
constexpr int32_t kMAX_TOPK_GROUP = 4;
} // namespace
Then update the checks:
-        TORCH_CHECK(top_k <= 8 && top_k > 0, "Current routing kernel (with groups) only supports top_k<=8 && top_k>0.");
+        TORCH_CHECK(top_k <= kMAX_TOP_K_GROUPED && top_k > 0, "Current routing kernel (with groups) only supports top_k<=8 && top_k>0.");
         TORCH_CHECK(topk_group.value() <= 4 && topk_group.value() > 0,
-            "Current routing kernel only (with groups) supports topk_group<=4 && topk_group > 0.");
+        TORCH_CHECK(topk_group.value() <= kMAX_TOPK_GROUP && topk_group.value() > 0,
+            "Current routing kernel only (with groups) supports topk_group<=4 && topk_group > 0.");
     }
     else if (static_cast<RoutingMethodType>(routing_method_type) == RoutingMethodType::Renormalize
         || static_cast<RoutingMethodType>(routing_method_type) == RoutingMethodType::RenormalizeNaive)
     {
-        TORCH_CHECK(top_k <= 10 && top_k > 0,
+        TORCH_CHECK(top_k <= kMAX_TOP_K_NO_GROUP && top_k > 0,
             "Current routing kernel (no groups, renormalize) only supports top_k<=10 && top_k>0.");
As per coding guidelines.
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)
2-2: Update copyright year to 2025.

The copyright header should include the current year (2025) per coding guidelines.

As per coding guidelines.

Apply this diff:
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
568-570: Prefer torch.sigmoid over F.sigmoid.

F.sigmoid is deprecated; use torch.sigmoid.

Apply this diff:
-        scores = F.sigmoid(logits)
+        scores = torch.sigmoid(logits)
tests/unittest/_torch/thop/parallel/test_noaux_tc.py (1)
77-80: Remove commented-out debug code.

The commented-out print statements should be removed entirely rather than left in the codebase as they serve no purpose and reduce code cleanliness.
-    # print(sorted_selected_values)
-    # print(ref_sorted_selected_values)
-    # print(selected_indices)
-    # print(ref_selected_indices)
     # compare
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh (5)
37-38: Constant naming/style per coding guidelines

Rename constants to k-prefixed UPPER_SNAKE_CASE and consider placing in a single header. Example: kMAX_NUM_EXPERTS_UNIT, kMAX_NUM_TOPK.

As per coding guidelines

189-190: Incorrect static_assert message for N constraint

Message says “<= 128” but the code requires N < 5 (per-thread candidates ≤ 4). Fix the message for clarity.

Apply this diff:
-    static_assert(N < 5, "Only support candidates number less than or equal to 128");
+    static_assert(N < 5, "Only support per-thread candidate count N <= 4");
Based on learnings

223-224: Inconsistent static_assert message for N limit

You assert N <= 16 but message references 16*32=512. Clarify that N is per-thread.

Apply this diff:
-    static_assert(N <= 16, "Only support candidates number less than or equal to 16*32=512");
+    static_assert(N <= 16, "Only support per-thread candidates N <= 16");
Based on learnings

240-244: Fragile/unnecessary idx initialization pattern

Initializing topKBufferIdx with ii*WarpSize - 1 risks negative tie-breakers affecting compaction; use -1 directly for clarity.

Apply this diff:
-        for (int ii = 0; ii < numResults; ++ii)
-        {
-            topKBufferValue[ii] = minValue;
-            topKBufferIdx[ii] = ii * WarpSize - 1; //@todo: check if this is correct
-        }
+        for (int ii = 0; ii < numResults; ++ii)
+        {
+            topKBufferValue[ii] = minValue;
+            topKBufferIdx[ii] = -1;
+        }
163-180: actualK handling

Remove TODO and clamp actualK to [1, K] to avoid surprises.

Apply this diff:
-    for (int kk = 0; kk < actualK; ++kk) //@todo: check if actualK is correct
+    actualK = max(0, min(actualK, K));
+    for (int kk = 0; kk < actualK; ++kk)
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (1)

269-277: numTokens comment mismatch

Comment says 10, value is 1024. Align comment or value to avoid confusion.
cpp/tensorrt_llm/kernels/noAuxTcKernels.h (2)
1-16: Update copyright year

File headers should include current year (2025).

Apply this diff:
- * Copyright (c) 2019-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2019-2025, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines

18-33: Prefer include guards over pragma once

Replace #pragma once with include guards TRTLLM_NOAUXTCKERNELS_H for consistency.

As per coding guidelines
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (1)

539-549: BlockScan template parameterization

BlockScan<int32_t, KernelParams::MaxNumExperts> assumes MaxNumExperts is within CUB limits and matches blockDim.x. If future instantiations use 384+, this becomes expensive. Consider warp scans + CTA reduction if needed.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (2)

41-66: Dead typedef and comment cleanup

DataTypeVec alias is unused; remove or use for vectorized loads later.

473-485: getMaxNumExperts() only supports <=128

This hard-caps Llama4 path; that’s fine if by design. Add a brief comment referencing the 128‑expert limit to avoid confusion with 384‑expert support elsewhere.

cpp/tensorrt_llm/thop/noAuxTcOp.cpp (1)

55-55: Stream from scores is fine; consider const-correct data pointers

Not a blocker, but these inputs are read-only. If feasible, change kernel signatures to accept const T* and use data_ptr() here.

cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1)

165-182: Single‑value reduceTopK: tie‑break relies on kMaxIdx; document or assert idx range

If idx can exceed 65535 in any caller, tie‑break fails. Consider static_assert or comment the constraint.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (2)

172-176: Intermediate top‑K staging looks sound; minor clarity nits

Logic matches the generic reduce path. Consider naming NumInterTopKPerThread -> kNumInterTopKPerThread for const and add a brief comment on layout.

Also applies to: 184-226

631-636: Dynamic threads selection

numThreadsMain chooses 256 vs 384. Consider using getMaxNumExperts(data.mNumExperts) for consistency with histogram threads.
cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (1)
239-253: Avoid magic epsilon; name it

Replace 1e-20 with a named constant to document intent and avoid magic numbers.
-        auto finalScore = static_cast<OutputT>(scoreNorm * routedScalingFactor / (redNorm + 1e-20));
+        constexpr float kEPS = 1e-20f;
+        auto finalScore = static_cast<OutputT>(scoreNorm * routedScalingFactor / (redNorm + kEPS));
As per coding guidelines
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (1)

407-422: Macro name typo: LAUNCH_ROUTING_RENORNALIZE

Consider renaming to LAUNCH_ROUTING_RENORMALIZE for consistency and grep‑ability across files.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ff9fa6 and 0614625.

📒 Files selected for processing (18)

cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (4 hunks)
cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (1 hunks)
cpp/tensorrt_llm/kernels/noAuxTcKernels.h (1 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h (2 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (15 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (11 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h (6 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh (3 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (10 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (9 hunks)
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (2 hunks)
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (1 hunks)
cpp/tensorrt_llm/thop/noAuxTcOp.cpp (2 hunks)
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (5 hunks)
cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (1 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2 hunks)
tests/unittest/_torch/thop/parallel/test_moe.py (5 hunks)
tests/unittest/_torch/thop/parallel/test_noaux_tc.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}