-
Notifications
You must be signed in to change notification settings - Fork 2k
[TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for 384 experts (MoE TRTLLM backend) #7761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8813980 to
d9061c9
Compare
d51729b to
0614625
Compare
|
/bot run |
|
PR_Github #21230 [ run ] triggered by Bot |
📝 WalkthroughWalkthroughRefactors and generalizes MoE routing/top‑K kernels to support larger expert counts and top‑k up to 10, adds generic warp top‑K reductions, and parameterizes launches by per‑kernel MaxNumExperts. Public API changes include noAuxTc switching from group_scores/scores_with_bias to bias. Tests and PyTorch integration updated accordingly. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant PT as PyTorch Op (noaux_tc_op)
participant K as invokeNoAuxTc
participant G as deepseek_v3_topk_kernel
participant R as reduceTopK(…)
participant HW as GPU Warps
PT->>K: scores, bias, n_group, topk_group, topk, routed_scaling_factor
K->>G: launch with params (groups, topk, bias)
Note over G: Compute per-expert scores (+bias)<br/>Group mapping and partial reductions
loop per warp/group
G->>R: reduceTopK (values, indices, K)
R->>HW: warp-level top‑K selection
HW-->>R: top‑K per warp
end
G-->>K: topk_values, topk_indices
K-->>PT: return tensors
sequenceDiagram
autonumber
participant Run as Routing::run()
participant Sel as getMaxNumExperts()
participant LM as LAUNCH_ROUTING_* macro
participant KR as Kernel (Histogram/Offsets/Block/Cluster)
Note over Run: RoutingDeepSeek/Llama4/Renormalize
Run->>Sel: numExperts
Sel-->>Run: MaxNumExperts (e.g., 128/256/384/512)
Run->>LM: choose launch variant (cooperative/grouped/etc.)
LM->>KR: launch __launch_bounds__(KernelParams::MaxNumExperts)
Note over KR: Uses dynamic MaxNumExperts for indexing,<br/>shared mem, and reductions
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
575-581: Fix undefined variable in debug path (scores_with_bias).scores_with_bias is used before being defined. Compute it from logits/bias in debug mode.
Apply this diff:
- if enable_llm_debug(): - has_nan = torch.isnan(scores_with_bias).any() - if has_nan: - warnings.warn( - "Detected NAN in the tensor scores_with_bias. Please check if it matches the expectation." - ) + if enable_llm_debug(): + debug_scores_with_bias = torch.sigmoid(logits) + e_score_correction_bias + if torch.isnan(debug_scores_with_bias).any(): + warnings.warn( + "Detected NaN in scores_with_bias (debug check)." + )
1-27: Add NVIDIA Apache‑2.0 header (retain attribution).File lacks the required NVIDIA Apache‑2.0 header. Keep the DeepSeek attribution block below it.
Apply this diff (update year as needed):
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + # -------------------------------------------------- # Portions of this code were derived from DeepSeek‑V3: # https://github.com/deepseek-ai/DeepSeek-V3As per coding guidelines
cpp/tensorrt_llm/thop/noAuxTcOp.cpp (1)
81-85: Use TORCH_CHECK instead of throwing exceptionsAvoid throwing across the extension boundary. Prefer TORCH_CHECK for invalid dtype.
- throw std::invalid_argument("Invalid dtype, only supports float16, float32, and bfloat16"); + TORCH_CHECK(false, "Invalid dtype, only supports float16, float32, and bfloat16");As per coding guidelines
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (2)
65-69: Fix potential aliasing UB in unpack()Reinterpreting a 64‑bit value as 32‑bit UnsignedBits by reference is unsafe. Use a width‑matched temporary.
- auto compactTmp = cmp >> kMoveBits; - auto valueBits - = cub::Traits<T>::TwiddleOut(reinterpret_cast<typename cub::Traits<T>::UnsignedBits&>(compactTmp)); - value = reinterpret_cast<T&>(valueBits); + auto compactTmp = cmp >> kMoveBits; + using UBits = typename cub::Traits<T>::UnsignedBits; + UBits ubits = static_cast<UBits>(compactTmp); + UBits twiddled = cub::Traits<T>::TwiddleOut(ubits); + value = reinterpret_cast<T&>(twiddled);As per coding guidelines
184-192: Misleading static_assert messageN is constrained to N < 5, but message claims “<= 128”. Fix message to avoid confusion.
- static_assert(N < 5, "Only support candidates number less than or equal to 128"); + static_assert(N < 5, "Only support up to 4 candidates per thread (N <= 4)");
🧹 Nitpick comments (21)
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (1)
118-130: Consider extracting magic numbers to named constants.The validation checks use several magic number literals (8, 10, 4) for top_k and topk_group bounds. Per coding guidelines, prefer named constants over magic literals for comparisons and logic.
Consider refactoring to use named constants:
namespace { constexpr int32_t kMAX_TOP_K_GROUPED = 8; constexpr int32_t kMAX_TOP_K_NO_GROUP = 10; constexpr int32_t kMAX_TOPK_GROUP = 4; } // namespaceThen update the checks:
- TORCH_CHECK(top_k <= 8 && top_k > 0, "Current routing kernel (with groups) only supports top_k<=8 && top_k>0."); + TORCH_CHECK(top_k <= kMAX_TOP_K_GROUPED && top_k > 0, "Current routing kernel (with groups) only supports top_k<=8 && top_k>0."); TORCH_CHECK(topk_group.value() <= 4 && topk_group.value() > 0, - "Current routing kernel only (with groups) supports topk_group<=4 && topk_group > 0."); + TORCH_CHECK(topk_group.value() <= kMAX_TOPK_GROUP && topk_group.value() > 0, + "Current routing kernel only (with groups) supports topk_group<=4 && topk_group > 0."); } else if (static_cast<RoutingMethodType>(routing_method_type) == RoutingMethodType::Renormalize || static_cast<RoutingMethodType>(routing_method_type) == RoutingMethodType::RenormalizeNaive) { - TORCH_CHECK(top_k <= 10 && top_k > 0, + TORCH_CHECK(top_k <= kMAX_TOP_K_NO_GROUP && top_k > 0, "Current routing kernel (no groups, renormalize) only supports top_k<=10 && top_k>0.");As per coding guidelines.
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)
2-2: Update copyright year to 2025.The copyright header should include the current year (2025) per coding guidelines.
As per coding guidelines.
Apply this diff:
- * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved.tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
568-570: Prefer torch.sigmoid over F.sigmoid.F.sigmoid is deprecated; use torch.sigmoid.
Apply this diff:
- scores = F.sigmoid(logits) + scores = torch.sigmoid(logits)tests/unittest/_torch/thop/parallel/test_noaux_tc.py (1)
77-80: Remove commented-out debug code.The commented-out print statements should be removed entirely rather than left in the codebase as they serve no purpose and reduce code cleanliness.
- # print(sorted_selected_values) - # print(ref_sorted_selected_values) - # print(selected_indices) - # print(ref_selected_indices) # comparecpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh (5)
37-38: Constant naming/style per coding guidelinesRename constants to k-prefixed UPPER_SNAKE_CASE and consider placing in a single header. Example: kMAX_NUM_EXPERTS_UNIT, kMAX_NUM_TOPK.
As per coding guidelines
189-190: Incorrect static_assert message for N constraintMessage says “<= 128” but the code requires N < 5 (per-thread candidates ≤ 4). Fix the message for clarity.
Apply this diff:
- static_assert(N < 5, "Only support candidates number less than or equal to 128"); + static_assert(N < 5, "Only support per-thread candidate count N <= 4");Based on learnings
223-224: Inconsistent static_assert message for N limitYou assert N <= 16 but message references 16*32=512. Clarify that N is per-thread.
Apply this diff:
- static_assert(N <= 16, "Only support candidates number less than or equal to 16*32=512"); + static_assert(N <= 16, "Only support per-thread candidates N <= 16");Based on learnings
240-244: Fragile/unnecessary idx initialization patternInitializing topKBufferIdx with ii*WarpSize - 1 risks negative tie-breakers affecting compaction; use -1 directly for clarity.
Apply this diff:
- for (int ii = 0; ii < numResults; ++ii) - { - topKBufferValue[ii] = minValue; - topKBufferIdx[ii] = ii * WarpSize - 1; //@todo: check if this is correct - } + for (int ii = 0; ii < numResults; ++ii) + { + topKBufferValue[ii] = minValue; + topKBufferIdx[ii] = -1; + }
163-180: actualK handlingRemove TODO and clamp actualK to [1, K] to avoid surprises.
Apply this diff:
- for (int kk = 0; kk < actualK; ++kk) //@todo: check if actualK is correct + actualK = max(0, min(actualK, K)); + for (int kk = 0; kk < actualK; ++kk)cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (1)
269-277: numTokens comment mismatchComment says 10, value is 1024. Align comment or value to avoid confusion.
cpp/tensorrt_llm/kernels/noAuxTcKernels.h (2)
1-16: Update copyright yearFile headers should include current year (2025).
Apply this diff:
- * Copyright (c) 2019-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2019-2025, NVIDIA CORPORATION. All rights reserved.As per coding guidelines
18-33: Prefer include guards over pragma onceReplace #pragma once with include guards TRTLLM_NOAUXTCKERNELS_H for consistency.
As per coding guidelines
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (1)
539-549: BlockScan template parameterizationBlockScan<int32_t, KernelParams::MaxNumExperts> assumes MaxNumExperts is within CUB limits and matches blockDim.x. If future instantiations use 384+, this becomes expensive. Consider warp scans + CTA reduction if needed.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (2)
41-66: Dead typedef and comment cleanupDataTypeVec alias is unused; remove or use for vectorized loads later.
473-485: getMaxNumExperts() only supports <=128This hard-caps Llama4 path; that’s fine if by design. Add a brief comment referencing the 128‑expert limit to avoid confusion with 384‑expert support elsewhere.
cpp/tensorrt_llm/thop/noAuxTcOp.cpp (1)
55-55: Stream from scores is fine; consider const-correct data pointersNot a blocker, but these inputs are read-only. If feasible, change kernel signatures to accept const T* and use data_ptr() here.
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh (1)
165-182: Single‑value reduceTopK: tie‑break relies on kMaxIdx; document or assert idx rangeIf idx can exceed 65535 in any caller, tie‑break fails. Consider static_assert or comment the constraint.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (2)
172-176: Intermediate top‑K staging looks sound; minor clarity nitsLogic matches the generic reduce path. Consider naming NumInterTopKPerThread -> kNumInterTopKPerThread for const and add a brief comment on layout.
Also applies to: 184-226
631-636: Dynamic threads selectionnumThreadsMain chooses 256 vs 384. Consider using getMaxNumExperts(data.mNumExperts) for consistency with histogram threads.
cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (1)
239-253: Avoid magic epsilon; name itReplace 1e-20 with a named constant to document intent and avoid magic numbers.
- auto finalScore = static_cast<OutputT>(scoreNorm * routedScalingFactor / (redNorm + 1e-20)); + constexpr float kEPS = 1e-20f; + auto finalScore = static_cast<OutputT>(scoreNorm * routedScalingFactor / (redNorm + kEPS));As per coding guidelines
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (1)
407-422: Macro name typo: LAUNCH_ROUTING_RENORNALIZEConsider renaming to LAUNCH_ROUTING_RENORMALIZE for consistency and grep‑ability across files.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh(4 hunks)cpp/tensorrt_llm/kernels/noAuxTcKernels.cu(1 hunks)cpp/tensorrt_llm/kernels/noAuxTcKernels.h(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h(2 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu(15 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh(11 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h(6 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh(3 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu(10 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu(9 hunks)cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp(2 hunks)cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp(1 hunks)cpp/tensorrt_llm/thop/noAuxTcOp.cpp(2 hunks)cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp(5 hunks)cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp(1 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py(2 hunks)tests/unittest/_torch/thop/parallel/test_moe.py(5 hunks)tests/unittest/_torch/thop/parallel/test_noaux_tc.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...
Files:
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cppcpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.hcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuhcpp/tensorrt_llm/kernels/noAuxTcKernels.hcpp/tensorrt_llm/thop/fp4BlockScaleMoe.cppcpp/tensorrt_llm/thop/noAuxTcOp.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuhcpp/tensorrt_llm/kernels/moeTopKFuncs.cuhcpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.hcpp/tensorrt_llm/kernels/noAuxTcKernels.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.
Files:
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cppcpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.hcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuhcpp/tensorrt_llm/kernels/noAuxTcKernels.hcpp/tensorrt_llm/thop/fp4BlockScaleMoe.cppcpp/tensorrt_llm/thop/noAuxTcOp.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuhcpp/tensorrt_llm/kernels/moeTopKFuncs.cuhcpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.hcpp/tensorrt_llm/kernels/noAuxTcKernels.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpptests/unittest/_torch/thop/parallel/test_moe.pycpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.hcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuhtests/unittest/_torch/thop/parallel/test_noaux_tc.pycpp/tensorrt_llm/kernels/noAuxTcKernels.hcpp/tensorrt_llm/thop/fp4BlockScaleMoe.cppcpp/tensorrt_llm/thop/noAuxTcOp.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuhcpp/tensorrt_llm/kernels/moeTopKFuncs.cuhcpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.htensorrt_llm/_torch/models/modeling_deepseekv3.pycpp/tensorrt_llm/kernels/noAuxTcKernels.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.
Files:
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cppcpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.hcpp/tensorrt_llm/kernels/noAuxTcKernels.hcpp/tensorrt_llm/thop/fp4BlockScaleMoe.cppcpp/tensorrt_llm/thop/noAuxTcOp.cppcpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpptests/unittest/_torch/thop/parallel/test_moe.pycpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.hcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuhtests/unittest/_torch/thop/parallel/test_noaux_tc.pycpp/tensorrt_llm/kernels/noAuxTcKernels.hcpp/tensorrt_llm/thop/fp4BlockScaleMoe.cppcpp/tensorrt_llm/thop/noAuxTcOp.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuhcpp/tensorrt_llm/kernels/moeTopKFuncs.cuhcpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.htensorrt_llm/_torch/models/modeling_deepseekv3.pycpp/tensorrt_llm/kernels/noAuxTcKernels.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tests/unittest/_torch/thop/parallel/test_moe.pytests/unittest/_torch/thop/parallel/test_noaux_tc.pytensorrt_llm/_torch/models/modeling_deepseekv3.py
**/*.{h,hpp,hh,hxx}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.
Files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.hcpp/tensorrt_llm/kernels/noAuxTcKernels.hcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
**/*.{h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).
Files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.hcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuhcpp/tensorrt_llm/kernels/noAuxTcKernels.hcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuhcpp/tensorrt_llm/kernels/moeTopKFuncs.cuhcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
🧠 Learnings (2)
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
PR: NVIDIA/TensorRT-LLM#7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
Applied to files:
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuhcpp/tensorrt_llm/thop/fp4BlockScaleMoe.cppcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuhcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
🧬 Code graph analysis (11)
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
top_k(233-233)
tests/unittest/_torch/thop/parallel/test_moe.py (5)
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (8)
param(47-152)param(47-47)param(154-165)param(154-154)param(167-176)param(167-167)param(203-209)param(203-204)cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (6)
param(39-144)param(39-39)param(146-152)param(146-146)param(184-190)param(184-185)cpp/tests/unit_tests/kernels/routing/routingLlama4Test.cpp (2)
param(39-91)param(39-39)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
RoutingMethodType(39-124)tensorrt_llm/_torch/modules/fused_moe/routing.py (1)
RoutingMethodType(143-155)
cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (1)
cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp (8)
param(47-152)param(47-47)param(154-165)param(154-154)param(167-176)param(167-167)param(203-209)param(203-204)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (6)
routingTopKExperts(36-39)void(83-240)void(316-319)void(325-384)getMaxNumExperts(388-403)getMaxNumExperts(388-388)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (2)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (5)
__launch_bounds__(402-402)getMaxNumExperts(474-485)getMaxNumExperts(474-474)routingIndicesClusterKernel(312-390)routingIndicesClusterKernel(392-392)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (6)
__launch_bounds__(83-83)__launch_bounds__(325-325)getMaxNumExperts(388-403)getMaxNumExperts(388-388)routingIndicesClusterKernel(245-314)routingIndicesClusterKernel(316-316)
cpp/tensorrt_llm/kernels/noAuxTcKernels.h (1)
cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (3)
void(44-258)invokeNoAuxTc(261-306)invokeNoAuxTc(261-263)
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (2)
top_k(233-233)num_experts(226-226)
cpp/tensorrt_llm/thop/noAuxTcOp.cpp (1)
cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (2)
invokeNoAuxTc(261-306)invokeNoAuxTc(261-263)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (14)
_(59-75)_(79-81)_(84-87)_(90-95)_(106-107)_(110-122)_(125-130)_(133-138)_(141-146)_(149-161)_(164-170)_(173-174)_(177-180)_(184-185)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (4)
num_experts(226-226)n_group(234-234)top_k(233-233)topk_group(236-236)
cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (3)
void(35-277)void(301-304)void(500-503)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (2)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (5)
void(35-277)void(301-304)void(500-503)getMaxNumExperts(506-525)getMaxNumExperts(506-506)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (6)
void(70-307)void(392-395)void(402-471)getMaxNumExperts(474-485)getMaxNumExperts(474-474)routingIndicesHistogramScoresKernel(402-402)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (24)
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (1)
129-130: LGTM! Verify asymmetry is intentional.The top_k limit increase from 8 to 10 for no-group Renormalize routing is correctly implemented, with both the condition and error message updated consistently.
Note that grouped routing (line 118) retains the
top_k <= 8limit while no-group routing now allowstop_k <= 10. This asymmetry appears intentional per the PR objectives, but please confirm this is by design and that the underlying kernel implementations support these different limits.cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (2)
133-133: Define and use a named constant for max experts and verify kernel support
Introduce infp4BlockScaleMoe.cpp:namespace { constexpr int32_t kMAX_NUM_EXPERTS = 512; } // namespaceThen replace the check:
- TORCH_CHECK(num_experts <= 512, "num_experts must be less than or equal to 512"); + TORCH_CHECK(num_experts <= kMAX_NUM_EXPERTS, + "num_experts must be ≤ " + std::to_string(kMAX_NUM_EXPERTS));No existing device-code limit for 512 was found—manually confirm that all
KernelParams::MaxNumExpertsinstantiations and__launch_bounds__accommodate 512 experts.
205-205: Replace magic literal256 * 2with a named constant and document its rationale
Define akMinExpertCountHistogramSizeand add a comment explaining why the fallback minimum remains 512 instead of scaling withnum_experts * 2.tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
632-634: Ignore custom op binding mismatch warning Stub and C++ registrations both accept (scores, scores_with_bias, n_group, topk_group, topk, routed_scaling_factor) matching the call’s six arguments; no change needed.Likely an incorrect or invalid review comment.
582-590: Avoid mutatingself.is_fused; use a per-callfused_enabledflag
- Mutating
self.is_fusedpersists fusion state across calls—introduce a localfused_enabledinforward()and update it instead.- Confirm whether the fused kernels now support
top_k > 8(e.g., up to 10) and adjust the threshold accordingly.tests/unittest/_torch/thop/parallel/test_moe.py (3)
878-878: LGTM! Configuration aligns with PR objectives.Adding the (384, 1, 1, 8) configuration properly expands test coverage for the 384-expert support mentioned in the PR objectives for KIMI K2 and Qwen-next models.
1061-1072: LGTM! Test coverage for Qwen-next routing added.The new test case with 512 experts and top_k=10 correctly validates the expanded routing capabilities for the Renormalize method used by Qwen-next models.
1145-1156: LGTM! Consistent test coverage in no-autotune path.The duplicate test case in the no-autotune section ensures consistent validation of 512 experts with top_k=10 across both autotuned and non-autotuned code paths.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h (3)
120-134: LGTM! Llama4-specific optimization with fixed expert count.The hardcoded 128 value for Llama4 routing is a valid compile-time optimization. The comment clearly indicates this is intentional and specific to the Llama4 routing method.
175-200: LGTM! Simplified macro variant without forceFloatInput.The new
LAUNCH_ROUTING_WITH_NUM_EXPERTSmacro provides a cleaner interface for cases whereforceFloatInputis not needed, properly threading thenumExpertsparameter through all dispatch paths.
136-171: Verified numExperts consistency – all LAUNCH_ROUTING_WITH_NUM_EXPERTS_FORCE_FLOAT_INPUT calls pass the exact compile-time constants (topk::MaxNumExpertsUnit, NumDeepseekExperts, NumKimiK2Experts) returned by getMaxNumExperts and matching the KernelParams::MaxNumExperts template parameter.cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h (2)
108-113: LGTM! Per-kernel expert count parameterization introduced.The addition of
MaxNumExperts_as a template parameter enables compile-time specialization based on expert count, which should improve performance for fixed-size configurations like 384 experts.
183-184: LGTM! Consistent template parameter propagation.All three routing kernel param specializations (routingDeepSeek, routingLlama4, routingRenormalize) correctly inherit the new
MaxNumExperts_parameter fromKernelParamsBaseand propagate it through their template hierarchies.Also applies to: 242-243, 284-285
tests/unittest/_torch/thop/parallel/test_noaux_tc.py (1)
11-11: LGTM! Test coverage for 384 experts added.The addition of (384, 1, 1, 8) to the test parameterization properly validates the 384-expert support mentioned in the PR objectives.
cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (1)
305-336: LGTM! Comprehensive test coverage for large expert counts added.The three new test cases (BlockLevelParallelizationLargeN, ClusterLevelParallelizationLargeN, DeviceLevelParallelizationLargeN) provide thorough validation of the 512-expert, top_k=10 configuration across different parallelization strategies, ensuring correctness at multiple scaling levels.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (2)
438-469: Block stride assumes blockDim.x == KernelParams::MaxNumExpertsgridBlockOffset/gridStride use KernelParams::MaxNumExperts. Ensure all launches set blockDim.x to match the KernelParams instantiation; otherwise, indexing is off.
Consider asserting at runtime in launcher or documenting this invariant.
598-605: Last-warp election relies on exact warp countCondition uses KernelParams::MaxNumExperts / WarpSize - 1. Ensure MaxNumExperts is a multiple of WarpSize (it appears intended). If not, guard with ceil division or an explicit assert.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (1)
489-510: Explicit constraints checks look goodTopK<=1, experts<=128, experts%4==0, padding<8. LGTM.
Ensure LAUNCH_ROUTING_LLAMA4 instantiates KernelParams with MaxNumExperts==numThreadsHist (128) so launch_bounds and shared array sizes match the launch.
cpp/tensorrt_llm/kernels/noAuxTcKernels.h (1)
28-32: Public API change verified: allinvokeNoAuxTccalls and explicit instantiations now include the newbiasparameter and match the updated signature.cpp/tensorrt_llm/thop/noAuxTcOp.cpp (1)
61-79: Casting pattern is OK; ensure contiguityEnsure scores/bias/topk tensors are contiguous to match kernel assumptions.
Run at-call-site or add:
+ TORCH_CHECK(scores.is_contiguous() && bias.is_contiguous() + && topk_values.is_contiguous() && topk_indices.is_contiguous(), + "All tensors must be contiguous");As per coding guidelines
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (1)
575-594: Top‑K bounds checksGood guardrails for MaxNumTopExperts and warp constraints. Confirm topk::MaxNumTopK aligns with test matrices (K up to 8).
cpp/tensorrt_llm/kernels/noAuxTcKernels.cu (1)
18-37: Static constants: scope and naming OKWARP_SIZE/MaxNumTop* constants conform to guidelines. LGTM.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (2)
24-34: Top‑K increased to 10; ensure downstream buffers account for 10Confirm all packed Top‑K paths allocate 10 (weights/ids). Looks consistent here.
488-507: Histogram/offset launches use getMaxNumExpertsGood use of per‑kernel MaxNumExperts. LGTM.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh
Show resolved
Hide resolved
|
PR_Github #21230 [ run ] completed with state |
325fa13 to
52c02ef
Compare
MatthiasKohl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I just had a question about the structure, but this should not hold the PR back, and can always be addressed in a future PR (if it even needs to be addressed).
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh
Show resolved
Hide resolved
52c02ef to
4dc2944
Compare
|
/bot run |
|
PR_Github #21587 [ run ] triggered by Bot |
|
PR_Github #21587 [ run ] completed with state |
|
/bot run |
|
PR_Github #21649 [ run ] triggered by Bot |
|
PR_Github #21649 [ run ] completed with state |
f155ee3 to
7d74701
Compare
|
/bot run |
|
PR_Github #21711 [ run ] completed with state |
|
/bot run |
|
PR_Github #21739 [ run ] triggered by Bot. Commit: |
|
PR_Github #21739 [ run ] completed with state |
|
/bot run |
|
PR_Github #21743 [ run ] triggered by Bot. Commit: |
|
PR_Github #21743 [ run ] completed with state |
|
/bot run |
|
PR_Github #21747 [ run ] triggered by Bot. Commit: |
|
PR_Github #21747 [ run ] completed with state |
…TLASS backend Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
fc17459 to
d2202e8
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #21764 [ run ] triggered by Bot. Commit: |
|
PR_Github #21764 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #21781 [ run ] triggered by Bot. Commit: |
|
PR_Github #21781 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #21797 [ run ] triggered by Bot. Commit: |
|
PR_Github #21797 [ run ] completed with state |
…UTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (NVIDIA#7761) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
…UTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (NVIDIA#7761) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com> Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
…UTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (NVIDIA#7761) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
…UTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (NVIDIA#7761) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
…UTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (NVIDIA#7761) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
…UTLASS backend); Add support for KimiK2 and Qwen-next (MoE TRTLLM backend) (NVIDIA#7761) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Summary by CodeRabbit
Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for KIMI K2 and Qwen-next (MoE TRTLLM backend)
We fused the PyTorch OPs for the routing part of Deepseel V3, and now we optimize its performance like what we have done for the MoE TRTLLM backend.
Also add support for KIMI K2 and Qwen-next (MoE TRTLLM backend)
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.