[https://nvbugs/5451205][feat] Add cuBLASLt NVFP4 GEMM backend support #7943

Wong4j · 2025-09-24T03:37:13Z

Summary by CodeRabbit

New Features
- Optional FP4 GEMM acceleration via cuBLASLt, with runtime availability detection.
- New FP4 scaled matrix-multiply op exposed to PyTorch (torch.ops.trtllm.cublas_fp4_scaled_mm).
- Linear layer option to use the cuBLASLt FP4 path; BF16 output supported.
- Preserves existing paths when cuBLASLt is unavailable or the option is disabled.
- (Update Oct 22) Further support for autotune allows selecting the fastest kernel from the algorithms returned by the cublaslt heuristic.
Performance
- Potential speedups for FP4 matrix multiplications on supported GPUs.
Tests
- Added unit and performance tests validating the cuBLASLt FP4 path across shapes and dtypes.

Description

New Features
- Added an optional nvfp4 block-scaled gemm path. Supports heuristic algorithm selection and bf16 output.
- Added a corresponding argument in the linear layer to enable it; disabled by default. When disabled, the original behavior is maintained.
- Added unit tests comparing with the existing cutlass implementation, with consistent results.

Test Coverage

op UT:
pytest -s -o log_cli=true tests/unittest/_torch/thop/parallel/test_fp4_linear.py -k "test_fp4_linear_cublaslt"

model UT:
pytest -s -o log_cli=true "tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4"

Perf

All tests use autotune to traverse and select the fastest kernel.

Shape (M×N×K)	CUTLASS (μs)	cuBLASLt (μs)	Speedup
8192×8192×1024	57.34	60.77	0.94x
8192×8192×2048	92.83	83.62	1.11x
8192×8192×4096	151.17	140.06	1.08x
8192×8192×8192	267.49	250.62	1.07x
8192×8192×16384	484.86	459.97	1.05x

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-09-25T02:50:30Z

📝 Walkthrough

Walkthrough

Adds optional cuBLASLt FP4 GEMM support gated by a new CMake option. Wires compile definitions, implements FP4 GEMM in CublasMMWrapper, introduces a Torch extension (C++/Python) for FP4 scaled matmul, adds availability checks, integrates a selectable path into Linear, and provides unit/perf tests.

Changes

Cohort / File(s)	Summary
Build option toggle `cpp/CMakeLists.txt`	Adds option ENABLE_CUBLASLT_FP4_GEMM (default ON).
Common target compile defs `cpp/tensorrt_llm/common/CMakeLists.txt`	Propagates ENABLE_CUBLASLT_FP4_GEMM as a compile definition to common_src when enabled.
cuBLASLt FP4 GEMM core (impl) `cpp/tensorrt_llm/common/cublasMMWrapper.cpp`	Adds FP4 descriptor handling and scale modes; introduces setFP4GemmConfig and Fp4Gemm under ENABLE_CUBLASLT_FP4_GEMM; updates setGemmConfig to handle FP4 compute/scale types.
cuBLASLt FP4 GEMM core (API) `cpp/tensorrt_llm/common/cublasMMWrapper.h`	Declares Fp4Gemm and setFP4GemmConfig within ENABLE_CUBLASLT_FP4_GEMM.
THOP target wiring `cpp/tensorrt_llm/thop/CMakeLists.txt`	Adds cublasFp4ScaledMM.cpp to th_common sources; defines ENABLE_CUBLASLT_FP4_GEMM publicly when enabled.
Torch extension: FP4 scaled MM (C++) `cpp/tensorrt_llm/thop/cublasFp4ScaledMM.cpp`, `cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h`	Implements FP4 scaled matmul via CublasMMWrapper (BF16 out), input validation, workspace/stream setup; registers Torch ops (fragment + CUDA impl); exposes out/inplace-style and factory variants.
Python: cuBLASLt availability `tensorrt_llm/_torch/cublaslt_utils.py`	Adds IS_CUBLASLT_AVAILABLE flag set based on presence of torch.ops.trtllm.cublas_fp4_scaled_mm.
Python: custom op (fake backend) `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`	Registers fake op trtllm::cublas_fp4_scaled_mm returning empty tensor of inferred shape/dtype.
Python: Linear integration `tensorrt_llm/_torch/modules/linear.py`	Adds flag use_cublaslt_nvfp4_blockscaling_mm; creates beta tensor for NVFP4 path; routes to cublas_fp4_scaled_mm when IS_CUBLASLT_AVAILABLE and flag set.
Tests: FP4 cuBLASLt path `tests/unittest/_torch/thop/parallel/test_fp4_linear.py`	Adds cuBLASLt FP4 correctness and perf tests (with nvtx), guarded by architecture checks; updates main to run perf shapes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Linear as Linear.apply()
  participant Utils as cublaslt_utils.IS_CUBLASLT_AVAILABLE
  participant TorchOp as torch.ops.trtllm.cublas_fp4_scaled_mm
  participant THOP as cublasFp4ScaledMM.cpp
  participant Wrapper as CublasMMWrapper
  participant cuBLASLt as cuBLASLt

  User->>Linear: call with FP4 weights/scales and flag
  Linear->>Utils: check IS_CUBLASLT_AVAILABLE
  alt available and flag True
    Linear->>TorchOp: cublas_fp4_scaled_mm(A,B,scale_a,scale_b,alpha,beta,out_dtype)
    TorchOp->>THOP: dispatch CUDA impl
    THOP->>Wrapper: setFP4GemmConfig(BF16)
    THOP->>Wrapper: Fp4Gemm(transA, transB, M,N,K, A,B,C, scales, alpha,beta)
    Wrapper->>cuBLASLt: create desc, select heuristic, matmul
    cuBLASLt-->>Wrapper: status
    Wrapper-->>THOP: result
    THOP-->>Linear: Tensor
  else fallback
    Linear-->>User: use existing NVFP4 paths
  end

sequenceDiagram
  autonumber
  participant Import as Python import
  participant Utils as cublaslt_utils
  participant Torch as torch

  Import->>Utils: import IS_CUBLASLT_AVAILABLE
  Utils->>Torch: check torch.ops.trtllm.cublas_fp4_scaled_mm
  alt op present
    Utils-->>Import: IS_CUBLASLT_AVAILABLE = True
  else
    Utils-->>Import: IS_CUBLASLT_AVAILABLE = False
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[TRTLLM-6898][feat] Add Cute DSL nvfp4 linear op #7632 — Adds FP4 GEMM via Cute DSL nvfp4 backend in similar areas (custom ops, Linear, tests), closely related to this cuBLASLt-based FP4 path.

Suggested reviewers

liji-nv
yuxianq
Kefeng-Duan

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 19.23% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title follows the required format and concisely summarizes the main change, using a valid NVBugs ID and the [feat] type to clearly indicate that cuBLASLt NVFP4 GEMM backend support is being added.
Description Check	✅ Passed	The pull request description is well-structured and includes all major required sections from the template. The Description section clearly explains the new NVFP4 block-scaled GEMM path feature, noting it is optional and disabled by default to maintain backward compatibility. The Test Coverage section is particularly comprehensive, providing specific pytest commands for both operator and model unit tests, screenshots showing test results, and a detailed performance comparison table between CUTLASS and cuBLASLt implementations. The PR Checklist section includes all items from the template and shows that the submitter has reviewed and confirmed completion. The only minor limitation is that the PR title format is not explicitly shown in the description body (though it appears in the PR objectives), but this is not critical as the title is typically managed separately in GitHub's PR interface.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h (1)
1-34: Add project-required include guards (replace pragma once)

Headers must use named guards per guideline (TRTLLM_<FILE_NAME>_H). Replace pragma once with guards.
-#pragma once
+#ifndef TRTLLM_CUBLASFP4SCALEDMM_H
+#define TRTLLM_CUBLASFP4SCALEDMM_H
@@
 } // namespace torch_ext
+
+#endif // TRTLLM_CUBLASFP4SCALEDMM_H
Optionally, add brief Doxygen comments for the two declarations.

🧹 Nitpick comments (9)

tensorrt_llm/_torch/cublaslt_utils.py (2)

11-11: Remove redundant f-string.

No placeholders present.

-        logger.info(f"cuBLASLt FP4 GEMM is available")
+        logger.info("cuBLASLt FP4 GEMM is available")

10-12: Harden op-availability check against AttributeError.

Older/lite builds may not materialize the namespace; guard access.

-    if hasattr(torch.ops.trtllm, 'cublas_fp4_scaled_mm'):
+    trtllm_ns = getattr(torch.ops, "trtllm", None)
+    if trtllm_ns is not None and hasattr(trtllm_ns, "cublas_fp4_scaled_mm"):
         logger.info("cuBLASLt FP4 GEMM is available")
         IS_CUBLASLT_AVAILABLE = True

cpp/tensorrt_llm/common/cublasMMWrapper.h (1)

32-32: Add brief Doxygen for new FP4 APIs.

Meets doc requirements for headers.

 class CublasMMWrapper
 {
 public:
+    //! Configure GEMM types for FP4 inputs (A/B) and given C/output types.
+    //! \param outputType Output (C/D) data type, default BF16.
+    //! Note: Requires CUDA 12.8+ and cuBLASLt.
+#if defined(ENABLE_CUBLASLT_FP4_GEMM) && defined(ENABLE_FP4)
+    void setFP4GemmConfig(cudaDataType_t outputType /* = CUDA_R_16BF */);
+
+    //! Execute scaled FP4 GEMM (block-scales for A/B), using cuBLASLt.
+    //! A/B must be CUDA_R_4F_E2M1, alpha/beta are 32-bit scalars.
+    void Fp4Gemm(cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k,
+        void const* A, int lda, void const* B, int ldb, void* C, int ldc,
+        void const* a_sf, void const* b_sf, float const* alpha, float const* beta);
+#endif

cpp/tensorrt_llm/thop/CMakeLists.txt (1)

48-48: Move cublasFp4ScaledMM.cpp under the feature gate to avoid unnecessary/fragile builds

Compile the source only when ENABLE_CUBLASLT_FP4_GEMM is ON. This prevents accidental build/link failures on configurations where cuBLASLt FP4 isn’t enabled and reduces compilation surface.

Apply:
-  cublasFp4ScaledMM.cpp
And extend the feature gate below as:
 if(ENABLE_CUBLASLT_FP4_GEMM)
-  target_compile_definitions(th_common PUBLIC ENABLE_CUBLASLT_FP4_GEMM)
+  target_compile_definitions(th_common PUBLIC ENABLE_CUBLASLT_FP4_GEMM)
+  target_sources(th_common PRIVATE cublasFp4ScaledMM.cpp)
 endif()

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

477-493: Silence lint for unused fake-op parameters

Prefix unused args to avoid Ruff ARG001 while keeping the signature stable.

-@torch.library.register_fake("trtllm::cublas_fp4_scaled_mm")
+@torch.library.register_fake("trtllm::cublas_fp4_scaled_mm")
 def _(
     mat_a: torch.Tensor,
     mat_b: torch.Tensor,
-    scale_a: torch.Tensor,
-    scale_b: torch.Tensor,
-    alpha: torch.Tensor,
-    beta: torch.Tensor,
+    _scale_a: torch.Tensor,
+    _scale_b: torch.Tensor,
+    _alpha: torch.Tensor,
+    _beta: torch.Tensor,
     out_dtype: torch.dtype = torch.bfloat16,
 ) -> torch.Tensor:
     """Fake tensor implementation for cuBLASLt FP4 GEMM."""
     # Output shape: [M, N] where M = mat_a.size(0), N = mat_b.size(0)
     output_size = [mat_a.size(0), mat_b.size(0)]
     return mat_a.new_empty(output_size, dtype=out_dtype)

tensorrt_llm/_torch/modules/linear.py (1)

781-784: Use a stable dtype fallback to prevent behavior skew across backends

When module.dtype is None, explicitly fall back to input.dtype (matches patterns used elsewhere), avoiding silent BF16 default only on the cuBLASLt path.
-            output = torch.ops.trtllm.cublas_fp4_scaled_mm(
+            output = torch.ops.trtllm.cublas_fp4_scaled_mm(
                 act_fp4, module.weight, act_sf, module.weight_scale,
-                module.alpha, module.beta, module.dtype)
+                module.alpha, module.beta, module.dtype or input.dtype)

tests/unittest/_torch/thop/parallel/test_fp4_linear.py (3)

439-442: Set explicit tolerances for cross-backend comparisons

Small numerical diffs are expected; align with perf test tolerances.
-    torch.testing.assert_close(output_cublaslt, output_cutlass)
+    torch.testing.assert_close(output_cublaslt, output_cutlass, rtol=1e-2, atol=1e-2)
445-453: Unused argument in perf test signature

Prefix with underscore to silence linters or implement cold L2 like the CUTLASS path.
-def cublaslt_fp4_gemm_perf_test(
+def cublaslt_fp4_gemm_perf_test(
     dtype,
     SEQ_LEN,
     OUTPUT_SIZE,
     HIDDEN_SIZE,
     test_ref=True,
-    use_cold_l2_cache=True,
+    _use_cold_l2_cache=True,
     warmup_iterations=2,
     iterations=1000,
 ):
519-525: Rename unused loop variable

Minor lint fix.
-        for i in range(iterations):
+        for _i in range(iterations):
             output_cublaslt = torch.ops.trtllm.cublas_fp4_scaled_mm(

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb60671 and ccf6e76.

📒 Files selected for processing (11)

cpp/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/common/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/common/cublasMMWrapper.cpp (4 hunks)
cpp/tensorrt_llm/common/cublasMMWrapper.h (2 hunks)
cpp/tensorrt_llm/thop/CMakeLists.txt (2 hunks)
cpp/tensorrt_llm/thop/cublasFp4ScaledMM.cpp (1 hunks)
cpp/tensorrt_llm/thop/cublasFp4ScaledMM.h (1 hunks)
tensorrt_llm/_torch/cublaslt_utils.py (1 hunks)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1 hunks)
tensorrt_llm/_torch/modules/linear.py (5 hunks)
tests/unittest/_torch/thop/parallel/test_fp4_linear.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}