Skip to content

Conversation

@Wong4j
Copy link
Collaborator

@Wong4j Wong4j commented Sep 24, 2025

Summary by CodeRabbit

  • New Features

    • Optional FP4 GEMM acceleration via cuBLASLt, with runtime availability detection.
    • New FP4 scaled matrix-multiply op exposed to PyTorch (torch.ops.trtllm.cublas_fp4_scaled_mm).
    • Linear layer option to use the cuBLASLt FP4 path; BF16 output supported.
    • Preserves existing paths when cuBLASLt is unavailable or the option is disabled.
    • (Update Oct 22) Further support for autotune allows selecting the fastest kernel from the algorithms returned by the cublaslt heuristic.
  • Performance

    • Potential speedups for FP4 matrix multiplications on supported GPUs.
  • Tests

    • Added unit and performance tests validating the cuBLASLt FP4 path across shapes and dtypes.

Description

  • New Features
    • Added an optional nvfp4 block-scaled gemm path. Supports heuristic algorithm selection and bf16 output.
    • Added a corresponding argument in the linear layer to enable it; disabled by default. When disabled, the original behavior is maintained.
    • Added unit tests comparing with the existing cutlass implementation, with consistent results.

Test Coverage

op UT:
pytest -s -o log_cli=true tests/unittest/_torch/thop/parallel/test_fp4_linear.py -k "test_fp4_linear_cublaslt"
image

model UT:
pytest -s -o log_cli=true "tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4"
image

Perf

All tests use autotune to traverse and select the fastest kernel.

Shape (M×N×K) CUTLASS (μs) cuBLASLt (μs) Speedup
8192×8192×1024 57.34 60.77 0.94x
8192×8192×2048 92.83 83.62 1.11x
8192×8192×4096 151.17 140.06 1.08x
8192×8192×8192 267.49 250.62 1.07x
8192×8192×16384 484.86 459.97 1.05x

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@Wong4j Wong4j force-pushed the integrate_cublaslt_nvfp4_gemm branch 3 times, most recently from 97b1318 to ccf6e76 Compare September 25, 2025 02:43
@Wong4j Wong4j marked this pull request as ready for review September 25, 2025 02:43
@Wong4j Wong4j requested review from a team as code owners September 25, 2025 02:43
@Wong4j Wong4j requested review from HuiGao-NV and hyukn September 25, 2025 02:43
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 25, 2025

📝 Walkthrough

Walkthrough

Adds optional cuBLASLt FP4 GEMM support gated by a new CMake option. Wires compile definitions, implements FP4 GEMM in CublasMMWrapper, introduces a Torch extension (C++/Python) for FP4 scaled matmul, adds availability checks, integrates a selectable path into Linear, and provides unit/perf tests.

Changes

Cohort / File(s) Summary
Build option toggle
cpp/CMakeLists.txt
Adds option ENABLE_CUBLASLT_FP4_GEMM (default ON).
Common target compile defs
cpp/tensorrt_llm/common/CMakeLists.txt
Propagates ENABLE_CUBLASLT_FP4_GEMM as a compile definition to common_src when enabled.
cuBLASLt FP4 GEMM core (impl)
cpp/tensorrt_llm/common/cublasMMWrapper.cpp
Adds FP4 descriptor handling and scale modes; introduces setFP4GemmConfig and Fp4Gemm under ENABLE_CUBLASLT_FP4_GEMM; updates setGemmConfig to handle FP4 compute/scale types.
cuBLASLt FP4 GEMM core (API)
cpp/tensorrt_llm/common/cublasMMWrapper.h
Declares Fp4Gemm and setFP4GemmConfig within ENABLE_CUBLASLT_FP4_GEMM.
THOP target wiring
cpp/tensorrt_llm/thop/CMakeLists.txt
Adds cublasFp4ScaledMM.cpp to th_common sources; defines ENABLE_CUBLASLT_FP4_GEMM publicly when enabled.
Torch extension: FP4 scaled MM (C++)
cpp/tensorrt_llm/thop/cublasFp4ScaledMM.cpp, cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h
Implements FP4 scaled matmul via CublasMMWrapper (BF16 out), input validation, workspace/stream setup; registers Torch ops (fragment + CUDA impl); exposes out/inplace-style and factory variants.
Python: cuBLASLt availability
tensorrt_llm/_torch/cublaslt_utils.py
Adds IS_CUBLASLT_AVAILABLE flag set based on presence of torch.ops.trtllm.cublas_fp4_scaled_mm.
Python: custom op (fake backend)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
Registers fake op trtllm::cublas_fp4_scaled_mm returning empty tensor of inferred shape/dtype.
Python: Linear integration
tensorrt_llm/_torch/modules/linear.py
Adds flag use_cublaslt_nvfp4_blockscaling_mm; creates beta tensor for NVFP4 path; routes to cublas_fp4_scaled_mm when IS_CUBLASLT_AVAILABLE and flag set.
Tests: FP4 cuBLASLt path
tests/unittest/_torch/thop/parallel/test_fp4_linear.py
Adds cuBLASLt FP4 correctness and perf tests (with nvtx), guarded by architecture checks; updates main to run perf shapes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Linear as Linear.apply()
  participant Utils as cublaslt_utils.IS_CUBLASLT_AVAILABLE
  participant TorchOp as torch.ops.trtllm.cublas_fp4_scaled_mm
  participant THOP as cublasFp4ScaledMM.cpp
  participant Wrapper as CublasMMWrapper
  participant cuBLASLt as cuBLASLt

  User->>Linear: call with FP4 weights/scales and flag
  Linear->>Utils: check IS_CUBLASLT_AVAILABLE
  alt available and flag True
    Linear->>TorchOp: cublas_fp4_scaled_mm(A,B,scale_a,scale_b,alpha,beta,out_dtype)
    TorchOp->>THOP: dispatch CUDA impl
    THOP->>Wrapper: setFP4GemmConfig(BF16)
    THOP->>Wrapper: Fp4Gemm(transA, transB, M,N,K, A,B,C, scales, alpha,beta)
    Wrapper->>cuBLASLt: create desc, select heuristic, matmul
    cuBLASLt-->>Wrapper: status
    Wrapper-->>THOP: result
    THOP-->>Linear: Tensor
  else fallback
    Linear-->>User: use existing NVFP4 paths
  end
Loading
sequenceDiagram
  autonumber
  participant Import as Python import
  participant Utils as cublaslt_utils
  participant Torch as torch

  Import->>Utils: import IS_CUBLASLT_AVAILABLE
  Utils->>Torch: check torch.ops.trtllm.cublas_fp4_scaled_mm
  alt op present
    Utils-->>Import: IS_CUBLASLT_AVAILABLE = True
  else
    Utils-->>Import: IS_CUBLASLT_AVAILABLE = False
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • liji-nv
  • yuxianq
  • Kefeng-Duan

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.23% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title follows the required format and concisely summarizes the main change, using a valid NVBugs ID and the [feat] type to clearly indicate that cuBLASLt NVFP4 GEMM backend support is being added.
Description Check ✅ Passed The pull request description is well-structured and includes all major required sections from the template. The Description section clearly explains the new NVFP4 block-scaled GEMM path feature, noting it is optional and disabled by default to maintain backward compatibility. The Test Coverage section is particularly comprehensive, providing specific pytest commands for both operator and model unit tests, screenshots showing test results, and a detailed performance comparison table between CUTLASS and cuBLASLt implementations. The PR Checklist section includes all items from the template and shows that the submitter has reviewed and confirmed completion. The only minor limitation is that the PR title format is not explicitly shown in the description body (though it appears in the PR objectives), but this is not critical as the title is typically managed separately in GitHub's PR interface.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h (1)

1-34: Add project-required include guards (replace pragma once)

Headers must use named guards per guideline (TRTLLM_<FILE_NAME>_H). Replace pragma once with guards.

-#pragma once
+#ifndef TRTLLM_CUBLASFP4SCALEDMM_H
+#define TRTLLM_CUBLASFP4SCALEDMM_H
@@
 } // namespace torch_ext
+
+#endif // TRTLLM_CUBLASFP4SCALEDMM_H

Optionally, add brief Doxygen comments for the two declarations.

🧹 Nitpick comments (9)
tensorrt_llm/_torch/cublaslt_utils.py (2)

11-11: Remove redundant f-string.

No placeholders present.

-        logger.info(f"cuBLASLt FP4 GEMM is available")
+        logger.info("cuBLASLt FP4 GEMM is available")

10-12: Harden op-availability check against AttributeError.

Older/lite builds may not materialize the namespace; guard access.

-    if hasattr(torch.ops.trtllm, 'cublas_fp4_scaled_mm'):
+    trtllm_ns = getattr(torch.ops, "trtllm", None)
+    if trtllm_ns is not None and hasattr(trtllm_ns, "cublas_fp4_scaled_mm"):
         logger.info("cuBLASLt FP4 GEMM is available")
         IS_CUBLASLT_AVAILABLE = True
cpp/tensorrt_llm/common/cublasMMWrapper.h (1)

32-32: Add brief Doxygen for new FP4 APIs.

Meets doc requirements for headers.

 class CublasMMWrapper
 {
 public:
+    //! Configure GEMM types for FP4 inputs (A/B) and given C/output types.
+    //! \param outputType Output (C/D) data type, default BF16.
+    //! Note: Requires CUDA 12.8+ and cuBLASLt.
+#if defined(ENABLE_CUBLASLT_FP4_GEMM) && defined(ENABLE_FP4)
+    void setFP4GemmConfig(cudaDataType_t outputType /* = CUDA_R_16BF */);
+
+    //! Execute scaled FP4 GEMM (block-scales for A/B), using cuBLASLt.
+    //! A/B must be CUDA_R_4F_E2M1, alpha/beta are 32-bit scalars.
+    void Fp4Gemm(cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k,
+        void const* A, int lda, void const* B, int ldb, void* C, int ldc,
+        void const* a_sf, void const* b_sf, float const* alpha, float const* beta);
+#endif
cpp/tensorrt_llm/thop/CMakeLists.txt (1)

48-48: Move cublasFp4ScaledMM.cpp under the feature gate to avoid unnecessary/fragile builds

Compile the source only when ENABLE_CUBLASLT_FP4_GEMM is ON. This prevents accidental build/link failures on configurations where cuBLASLt FP4 isn’t enabled and reduces compilation surface.

Apply:

-  cublasFp4ScaledMM.cpp

And extend the feature gate below as:

 if(ENABLE_CUBLASLT_FP4_GEMM)
-  target_compile_definitions(th_common PUBLIC ENABLE_CUBLASLT_FP4_GEMM)
+  target_compile_definitions(th_common PUBLIC ENABLE_CUBLASLT_FP4_GEMM)
+  target_sources(th_common PRIVATE cublasFp4ScaledMM.cpp)
 endif()
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

477-493: Silence lint for unused fake-op parameters

Prefix unused args to avoid Ruff ARG001 while keeping the signature stable.

-@torch.library.register_fake("trtllm::cublas_fp4_scaled_mm")
+@torch.library.register_fake("trtllm::cublas_fp4_scaled_mm")
 def _(
     mat_a: torch.Tensor,
     mat_b: torch.Tensor,
-    scale_a: torch.Tensor,
-    scale_b: torch.Tensor,
-    alpha: torch.Tensor,
-    beta: torch.Tensor,
+    _scale_a: torch.Tensor,
+    _scale_b: torch.Tensor,
+    _alpha: torch.Tensor,
+    _beta: torch.Tensor,
     out_dtype: torch.dtype = torch.bfloat16,
 ) -> torch.Tensor:
     """Fake tensor implementation for cuBLASLt FP4 GEMM."""
     # Output shape: [M, N] where M = mat_a.size(0), N = mat_b.size(0)
     output_size = [mat_a.size(0), mat_b.size(0)]
     return mat_a.new_empty(output_size, dtype=out_dtype)
tensorrt_llm/_torch/modules/linear.py (1)

781-784: Use a stable dtype fallback to prevent behavior skew across backends

When module.dtype is None, explicitly fall back to input.dtype (matches patterns used elsewhere), avoiding silent BF16 default only on the cuBLASLt path.

-            output = torch.ops.trtllm.cublas_fp4_scaled_mm(
+            output = torch.ops.trtllm.cublas_fp4_scaled_mm(
                 act_fp4, module.weight, act_sf, module.weight_scale,
-                module.alpha, module.beta, module.dtype)
+                module.alpha, module.beta, module.dtype or input.dtype)
tests/unittest/_torch/thop/parallel/test_fp4_linear.py (3)

439-442: Set explicit tolerances for cross-backend comparisons

Small numerical diffs are expected; align with perf test tolerances.

-    torch.testing.assert_close(output_cublaslt, output_cutlass)
+    torch.testing.assert_close(output_cublaslt, output_cutlass, rtol=1e-2, atol=1e-2)

445-453: Unused argument in perf test signature

Prefix with underscore to silence linters or implement cold L2 like the CUTLASS path.

-def cublaslt_fp4_gemm_perf_test(
+def cublaslt_fp4_gemm_perf_test(
     dtype,
     SEQ_LEN,
     OUTPUT_SIZE,
     HIDDEN_SIZE,
     test_ref=True,
-    use_cold_l2_cache=True,
+    _use_cold_l2_cache=True,
     warmup_iterations=2,
     iterations=1000,
 ):

519-525: Rename unused loop variable

Minor lint fix.

-        for i in range(iterations):
+        for _i in range(iterations):
             output_cublaslt = torch.ops.trtllm.cublas_fp4_scaled_mm(
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb60671 and ccf6e76.

📒 Files selected for processing (11)
  • cpp/CMakeLists.txt (1 hunks)
  • cpp/tensorrt_llm/common/CMakeLists.txt (1 hunks)
  • cpp/tensorrt_llm/common/cublasMMWrapper.cpp (4 hunks)
  • cpp/tensorrt_llm/common/cublasMMWrapper.h (2 hunks)
  • cpp/tensorrt_llm/thop/CMakeLists.txt (2 hunks)
  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.cpp (1 hunks)
  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h (1 hunks)
  • tensorrt_llm/_torch/cublaslt_utils.py (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1 hunks)
  • tensorrt_llm/_torch/modules/linear.py (5 hunks)
  • tests/unittest/_torch/thop/parallel/test_fp4_linear.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...

Files:

  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h
  • cpp/tensorrt_llm/common/cublasMMWrapper.h
  • cpp/tensorrt_llm/common/cublasMMWrapper.cpp
  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.

Files:

  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h
  • cpp/tensorrt_llm/common/cublasMMWrapper.h
  • cpp/tensorrt_llm/common/cublasMMWrapper.cpp
  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • cpp/tensorrt_llm/common/cublasMMWrapper.h
  • tensorrt_llm/_torch/cublaslt_utils.py
  • tests/unittest/_torch/thop/parallel/test_fp4_linear.py
  • cpp/tensorrt_llm/common/cublasMMWrapper.cpp
  • tensorrt_llm/_torch/modules/linear.py
  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.cpp
**/*.{h,hpp,hh,hxx}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.

Files:

  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h
  • cpp/tensorrt_llm/common/cublasMMWrapper.h
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.

Files:

  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h
  • cpp/tensorrt_llm/common/cublasMMWrapper.h
  • cpp/tensorrt_llm/common/cublasMMWrapper.cpp
  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.cpp
**/*.{h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).

Files:

  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h
  • cpp/tensorrt_llm/common/cublasMMWrapper.h
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • cpp/tensorrt_llm/common/cublasMMWrapper.h
  • tensorrt_llm/_torch/cublaslt_utils.py
  • tests/unittest/_torch/thop/parallel/test_fp4_linear.py
  • cpp/tensorrt_llm/common/cublasMMWrapper.cpp
  • tensorrt_llm/_torch/modules/linear.py
  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.cpp
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/cublaslt_utils.py
  • tests/unittest/_torch/thop/parallel/test_fp4_linear.py
  • tensorrt_llm/_torch/modules/linear.py
🧠 Learnings (1)
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
PR: NVIDIA/TensorRT-LLM#7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

  • cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h
  • cpp/tensorrt_llm/common/cublasMMWrapper.h
  • cpp/tensorrt_llm/common/cublasMMWrapper.cpp
  • cpp/tensorrt_llm/thop/CMakeLists.txt
🧬 Code graph analysis (7)
cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h (1)
cpp/tensorrt_llm/thop/cublasFp4ScaledMM.cpp (4)
  • cublas_fp4_scaled_mm_out (103-135)
  • cublas_fp4_scaled_mm_out (103-104)
  • cublas_fp4_scaled_mm (137-149)
  • cublas_fp4_scaled_mm (137-138)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (15)
  • _ (13-56)
  • _ (60-62)
  • _ (65-68)
  • _ (71-76)
  • _ (79-84)
  • _ (87-99)
  • _ (102-107)
  • _ (110-115)
  • _ (118-123)
  • _ (126-138)
  • _ (141-147)
  • _ (150-151)
  • _ (154-157)
  • _ (161-162)
  • _ (165-176)
cpp/tensorrt_llm/common/cublasMMWrapper.h (1)
cpp/tensorrt_llm/common/cublasMMWrapper.cpp (4)
  • Fp4Gemm (529-587)
  • Fp4Gemm (529-531)
  • setFP4GemmConfig (282-285)
  • setFP4GemmConfig (282-282)
tests/unittest/_torch/thop/parallel/test_fp4_linear.py (3)
cpp/tensorrt_llm/thop/cublasFp4ScaledMM.cpp (2)
  • cublas_fp4_scaled_mm (137-149)
  • cublas_fp4_scaled_mm (137-138)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (11)
  • nvfp4_gemm (435-460)
  • _ (249-302)
  • _ (378-386)
  • _ (464-474)
  • _ (480-492)
  • _ (661-688)
  • _ (721-731)
  • _ (805-815)
  • _ (905-921)
  • _ (1002-1010)
  • _ (1043-1054)
tensorrt_llm/_torch/autotuner.py (1)
  • autotune (204-215)
cpp/tensorrt_llm/common/cublasMMWrapper.cpp (1)
cpp/tensorrt_llm/common/cublasMMWrapper.h (8)
  • mAType (38-38)
  • mBType (39-39)
  • mCType (40-40)
  • mComputeType (41-41)
  • mScaleType (42-42)
  • mADesc (45-158)
  • mBDesc (46-156)
  • mCDesc (47-154)
tensorrt_llm/_torch/modules/linear.py (2)
cpp/tensorrt_llm/thop/cublasFp4ScaledMM.cpp (2)
  • cublas_fp4_scaled_mm (137-149)
  • cublas_fp4_scaled_mm (137-138)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • nvfp4_gemm (435-460)
cpp/tensorrt_llm/thop/cublasFp4ScaledMM.cpp (2)
cpp/tensorrt_llm/common/cublasMMWrapper.cpp (3)
  • CublasMMWrapper (31-38)
  • CublasMMWrapper (40-40)
  • CublasMMWrapper (42-47)
cpp/tensorrt_llm/common/cublasMMWrapper.h (1)
  • getCublasHandle (145-148)
🪛 Ruff (0.13.1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

483-483: Unused function argument: scale_a

(ARG001)


484-484: Unused function argument: scale_b

(ARG001)


485-485: Unused function argument: alpha

(ARG001)


486-486: Unused function argument: beta

(ARG001)

tensorrt_llm/_torch/cublaslt_utils.py

11-11: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/unittest/_torch/thop/parallel/test_fp4_linear.py

450-450: Unused function argument: use_cold_l2_cache

(ARG001)


522-522: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
cpp/tensorrt_llm/common/cublasMMWrapper.cpp (2)

295-301: Guard FP4 branch in setGemmConfig to avoid referencing CUDA_R_4F_E2M1 on older toolkits.

Keeps builds green without FP4.

-    if (mAType == CUDA_R_4F_E2M1)
-    {
-        // for cublaslt nvfp4 gemm, fp32 compute type and fp32 scale type are required
-        mComputeType = CUBLAS_COMPUTE_32F;
-        mScaleType = CUDA_R_32F;
-    }
-    else if (isFp16ComputeType)
+#ifdef ENABLE_FP4
+    if (mAType == CUDA_R_4F_E2M1)
+    {
+        // for cublaslt nvfp4 gemm, fp32 compute type and fp32 scale type are required
+        mComputeType = CUBLAS_COMPUTE_32F;
+        mScaleType = CUDA_R_32F;
+    }
+    else
+#endif
+    if (isFp16ComputeType)
⛔ Skipped due to learnings
Learnt from: nv-lschneider
PR: NVIDIA/TensorRT-LLM#7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

83-114: Guard FP4 scale-mode attributes with ENABLE_FP4.

Scale enums are only available with newer cuBLASLt; guard them.

-    // Set scaling modes for FP4 GEMM
-    if (mAType == CUDA_R_4F_E2M1)
-    {
-        // Set scaling mode - cuBLASLt requires e4m3 format scaling factors
-        cublasLtMatmulMatrixScale_t AScaleMode = CUBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE4M3;
-        cublasLtMatmulMatrixScale_t BScaleMode = CUBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE4M3;
-        cublasLtMatmulMatrixScale_t CScaleMode = CUBLASLT_MATMUL_MATRIX_SCALE_SCALAR_32F;
-        cublasLtMatmulMatrixScale_t DScaleMode = CUBLASLT_MATMUL_MATRIX_SCALE_SCALAR_32F;
-        cublasLtMatmulMatrixScale_t DOutScaleMode = CUBLASLT_MATMUL_MATRIX_SCALE_SCALAR_32F;
+    // Set scaling modes for FP4 GEMM
+#ifdef ENABLE_FP4
+    if (mAType == CUDA_R_4F_E2M1)
+    {
+        // cuBLASLt requires e4m3 scale vectors for FP4 inputs
+        cublasLtMatmulMatrixScale_t AScaleMode = CUBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE4M3;
+        cublasLtMatmulMatrixScale_t BScaleMode = CUBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE4M3;
+        cublasLtMatmulMatrixScale_t CScaleMode = CUBLASLT_MATMUL_MATRIX_SCALE_SCALAR_32F;
+        cublasLtMatmulMatrixScale_t DScaleMode = CUBLASLT_MATMUL_MATRIX_SCALE_SCALAR_32F;
+        cublasLtMatmulMatrixScale_t DOutScaleMode = CUBLASLT_MATMUL_MATRIX_SCALE_SCALAR_32F;
 
         check_cuda_error(cublasLtMatmulDescSetAttribute(
             mOperationDesc, CUBLASLT_MATMUL_DESC_A_SCALE_MODE, &AScaleMode, sizeof(AScaleMode)));
         check_cuda_error(cublasLtMatmulDescSetAttribute(
             mOperationDesc, CUBLASLT_MATMUL_DESC_B_SCALE_MODE, &BScaleMode, sizeof(BScaleMode)));
         check_cuda_error(cublasLtMatmulDescSetAttribute(
             mOperationDesc, CUBLASLT_MATMUL_DESC_C_SCALE_MODE, &CScaleMode, sizeof(CScaleMode)));
         check_cuda_error(cublasLtMatmulDescSetAttribute(
             mOperationDesc, CUBLASLT_MATMUL_DESC_D_SCALE_MODE, &DScaleMode, sizeof(DScaleMode)));
         check_cuda_error(cublasLtMatmulDescSetAttribute(
             mOperationDesc, CUBLASLT_MATMUL_DESC_D_OUT_SCALE_MODE, &DOutScaleMode, sizeof(DOutScaleMode)));
 
         // Set C/D matrix scale pointers to nullptr
         void const* c_scale_ptr = nullptr;
         void const* d_scale_ptr = nullptr;
         void const* d_out_scale_ptr = nullptr;
         check_cuda_error(cublasLtMatmulDescSetAttribute(
             mOperationDesc, CUBLASLT_MATMUL_DESC_C_SCALE_POINTER, &c_scale_ptr, sizeof(c_scale_ptr)));
         check_cuda_error(cublasLtMatmulDescSetAttribute(
             mOperationDesc, CUBLASLT_MATMUL_DESC_D_SCALE_POINTER, &d_scale_ptr, sizeof(d_scale_ptr)));
         check_cuda_error(cublasLtMatmulDescSetAttribute(
             mOperationDesc, CUBLASLT_MATMUL_DESC_D_OUT_SCALE_POINTER, &d_out_scale_ptr, sizeof(d_out_scale_ptr)));
     }
+#endif
⛔ Skipped due to learnings
Learnt from: nv-lschneider
PR: NVIDIA/TensorRT-LLM#7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.
tensorrt_llm/_torch/modules/linear.py (4)

26-26: Import of IS_CUBLASLT_AVAILABLE: LGTM

Availability gating import looks correct and localized.


750-752: Initialize beta to zero: LGTM

Zero-initialized beta avoids using uninitialized output when beta != 0 in GEMM semantics.


1816-1818: New flag plumbing for cuBLASLt path: LGTM

Constructor flag is opt-in and defaults to False; preserves original behavior.


1836-1837: State wiring for cuBLASLt flag: LGTM

Field assignment matches constructor argument; consistent with other flags.

@Wong4j Wong4j force-pushed the integrate_cublaslt_nvfp4_gemm branch from b526993 to 1a57fe8 Compare September 25, 2025 03:13
@Tracin
Copy link
Collaborator

Tracin commented Sep 26, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20063 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20063 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15115 completed with status: 'SUCCESS'

@rosenrodt rosenrodt self-requested a review October 9, 2025 02:54
Copy link
Collaborator

@rosenrodt rosenrodt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments inline. Thank you!

@Wong4j Wong4j requested a review from a team as a code owner October 17, 2025 04:57
@Wong4j Wong4j force-pushed the integrate_cublaslt_nvfp4_gemm branch 2 times, most recently from 4edf269 to 05bc4b9 Compare October 17, 2025 06:48
@Wong4j
Copy link
Collaborator Author

Wong4j commented Oct 17, 2025

/bot run

@Wong4j Wong4j force-pushed the integrate_cublaslt_nvfp4_gemm branch from e21b5a3 to 59f1a5d Compare October 17, 2025 07:12
@Wong4j
Copy link
Collaborator Author

Wong4j commented Oct 17, 2025

/bot run

@Wong4j Wong4j force-pushed the integrate_cublaslt_nvfp4_gemm branch from 59f1a5d to f76ba94 Compare October 21, 2025 02:43
Signed-off-by: Shijie Wang <jaywan@nvidia.com>
Signed-off-by: Shijie Wang <jaywan@nvidia.com>
Signed-off-by: Shijie Wang <jaywan@nvidia.com>
Signed-off-by: Shijie Wang <jaywan@nvidia.com>
Signed-off-by: Shijie Wang <jaywan@nvidia.com>
Signed-off-by: Shijie Wang <jaywan@nvidia.com>
Signed-off-by: Shijie Wang <jaywan@nvidia.com>
Signed-off-by: Shijie Wang <jaywan@nvidia.com>
@Wong4j Wong4j force-pushed the integrate_cublaslt_nvfp4_gemm branch from b4a3a25 to d6be901 Compare October 21, 2025 02:51
Copy link
Collaborator

@hyukn hyukn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I suggest enabling DEBUG logger to use UT to check that tuning happens as expected and AutoTuner cache is filled with correct shapes.

Signed-off-by: Shijie Wang <jaywan@nvidia.com>
@rosenrodt
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21989 [ run ] triggered by Bot. Commit: e69138a

@rosenrodt rosenrodt self-requested a review October 21, 2025 07:01
Copy link
Collaborator

@rosenrodt rosenrodt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good and shows perf improvement. Thanks a lot and good work!

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21989 [ run ] completed with state SUCCESS. Commit: e69138a
/LLM/main/L0_MergeRequest_PR pipeline #16579 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@MartinMarciniszyn MartinMarciniszyn enabled auto-merge (squash) October 23, 2025 07:45
Copy link
Collaborator

@HuiGao-NV HuiGao-NV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MartinMarciniszyn MartinMarciniszyn merged commit 928247a into NVIDIA:main Oct 23, 2025
9 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Oct 24, 2025
NVIDIA#7943)

Signed-off-by: Shijie Wang <jaywan@nvidia.com>
Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 1, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants