[https://nvbugs/5569081][fix] Upgrade fmha_v2. (cherry-pick from https://github.com/NVIDIA/TensorRT-LLM/pull/8364) #8499

yuxianq · 2025-10-20T09:35:01Z

Summary by CodeRabbit

Chores
- Updated compiled kernel binaries for multi-head attention operations across various data type configurations (BF16, FP16, E4M3, FP32) and model architectures (SM80, SM89, SM90). Binary pointer metadata refreshed with new references and sizes.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq · 2025-10-20T09:35:55Z

/bot run --disable-fail-fast

coderabbitai · 2025-10-20T09:40:38Z

📝 Walkthrough

Walkthrough

This pull request updates Git LFS pointers for 92 compiled CUDA binary files (cubin) used in TensorRT-LLM's fused multi-head attention kernels. Each file's metadata—specifically the blob identifier (oid sha256) and size—has been updated. No source code, logic, or public API changes are present.

Changes

Cohort / File(s)	Summary
BF16 Attention Kernels `cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_bf16_64_128_S_q_kv_128_softmax_tma_ws_sm90.cubin.cpp`, `fmha_v2_flash_attention_bf16_64_128_S_q_kv_128_tma_ws_sm90.cubin.cpp`, `fmha_v2_flash_attention_bf16_64_128_S_q_paged_kv_128_.cubin.cpp` (3 variants), `fmha_v2_flash_attention_bf16_64_128_S_qkv_128_.cubin.cpp` (4 variants), `fmha_v2_flash_attention_bf16_64_32_S_qkv_128_*.cubin.cpp` (2 variants)	LFS pointer updates with new oid hashes and adjusted file sizes for bf16 precision attention kernels with various configurations (softmax, alibi, softcapping, tma_ws)
E4M3 SM90 Kernels `cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_64_256_S_.cubin.cpp` (8 files), `fmha_v2_flash_attention_e4m3_64_256_S_qkv_128_sage_.cubin.cpp` (1 file)	LFS pointer updates for e4m3 precision kernels targeting SM90 architecture with various head dimensions (64×256) and feature combinations
E4M3 SM89 128-bit Kernels `cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_128_128_S_*.cubin.cpp` (9 files)	LFS pointer updates for e4m3 fp32 128×128 configuration kernels on SM89 with various paged KV head dimensions and configurations
E4M3 SM89 64-bit Kernels (Group 1) `cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_q_kv_.cubin.cpp` (2 files), `fmha_v2_flash_attention_e4m3_fp32_64_32_S_q_paged_kv_.cubin.cpp` (6 files)	LFS pointer updates for e4m3 fp32 64×32 configuration with multiple paged KV variants on SM89
E4M3 SM89 64-bit Kernels (Group 2) `cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_qkv_.cubin.cpp` (7 files), `fmha_v2_flash_attention_e4m3_fp32_64_32_S_qkv_80_sage_.cubin.cpp` (2 files)	LFS pointer updates for e4m3 fp32 64×32 qkv kernels with various dimensions and sage output configurations
FP16 Attention Kernels `cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_fp16_64_128_S_q_kv_128_softmax_tma_ws_sm90.cubin.cpp`, `fmha_v2_flash_attention_fp16_64_128_S_q_kv_128_tma_ws_sm90.cubin.cpp`, `fmha_v2_flash_attention_fp16_64_128_S_q_paged_kv_128_.cubin.cpp` (3 variants), `fmha_v2_flash_attention_fp16_64_128_S_qkv_128_.cubin.cpp` (4 variants), `fmha_v2_flash_attention_fp16_64_32_S_qkv_128_*.cubin.cpp` (2 variants)	LFS pointer updates for fp16 precision kernels with SM90 targeting and various attention configurations
FP16+FP32 Hybrid Kernels (128×128) `cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_fp16_fp32_64_128_S_q_kv_128_softmax_tma_ws_sm90.cubin.cpp`, `fmha_v2_flash_attention_fp16_fp32_64_128_S_q_kv_128_tma_ws_sm90.cubin.cpp`, `fmha_v2_flash_attention_fp16_fp32_64_128_S_q_paged_kv_128_.cubin.cpp` (3 variants), `fmha_v2_flash_attention_fp16_fp32_64_128_S_qkv_128_.cubin.cpp` (4 variants)	LFS pointer updates for hybrid fp16/fp32 kernels with 64×128 head dimensions
FP16+FP32 Hybrid Kernels (64×32) `cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_fp16_fp32_64_32_S_qkv_128_*.cubin.cpp` (2 files)	LFS pointer updates for hybrid fp16/fp32 kernels with 64×32 head dimensions

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Rationale: While this PR contains 92 file changes, they are entirely homogeneous—each consists solely of updating Git LFS pointer metadata (oid sha256 and file size) with no code logic, control flow, or functional changes. The repetitive nature of these binary asset updates significantly reduces review complexity. A reviewer can verify the pattern on a few representative files and spot-check that all updates are syntactically correct LFS pointer format changes.

Possibly related PRs

NVIDIA/TensorRT-LLM#8364: Updates similar Git LFS pointers for fmha_v2 cubin files including overlapping kernel variants (e.g., fmha_v2_flash_attention_bf16_64_128_S_q_kv_128_softmax_tma_ws_sm90)—likely part of the same kernel rebuild or optimization cycle.

Suggested reviewers

PerkzZheng
lowsfer
Wanli-Jiang

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description contains the repository's template structure but lacks critical content in key sections. While the Description and Test Coverage sections are present in the template, they remain empty and are not populated with specific implementation details, the rationale for the changes, or information about test cases. Only a single placeholder checkbox ("Please check this after reviewing the above items as appropriate for this PR") is marked as complete, while the substantive checklist items (such as confirming the PR explanation, coding guidelines compliance, test case provision, and documentation updates) are not addressed. This leaves insufficient information for reviewers to understand what is being changed and why, or what testing validates the upgrade.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "[https://nvbugs/5569081][fix] Upgrade fmha_v2. (cherry-pick from #8364)" follows the required format with an NVBugs ticket ID, a type indicator ([fix]), and a clear summary. The title directly relates to the changeset, which consists entirely of updates to Git LFS pointers for FMHA v2 flash attention kernel binary files across multiple configurations and data types. The stated intent to "upgrade fmha_v2" aligns well with the pattern of version updates observed across all modified files, making the title both specific and accurate.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (13)

cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_q_paged_kv_96_sm89.cubin.cpp (2)
1-3: Silence Clang/compile for LFS .cpp pointers to avoid false errors.

Static analysis errors are expected here since these files aren’t C++ sources. Exclude them from compilation/tidy.

Examples:

CMake
# Mark LFS pointer stubs as non-compilable sources
file(GLOB CUBIN_POINTERS
  "${CMAKE_SOURCE_DIR}/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/*.cubin.cpp")
set_source_files_properties(${CUBIN_POINTERS} PROPERTIES HEADER_FILE_ONLY ON)
Or exclude the cubin directory from your clang-tidy/run-clang-tidy target (path filter/globs).

1-1: License header on .cpp (optional).

These .cpp pointer stubs lack the NVIDIA Apache-2.0 header. Either add a brief header comment or formally exempt this path from the header check to satisfy policy. As per coding guidelines.
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_q_kv_128_sm89.cubin.cpp (1)
1-3: Optional: add a CMake guard to fail fast when LFS content isn’t present.

To avoid hard-to-read compiler errors, add a small configure-time check that errors if a .cubin.cpp still contains the LFS header.

Example snippet you can adapt:
# In the CMake for this target:
function(check_lfs_resolved src)
  file(READ "${src}" _contents LIMIT 64)
  string(FIND "${_contents}" "git-lfs.github.com/spec/v1" _pos)
  if(NOT _pos EQUAL -1)
    message(FATAL_ERROR "Unresolved Git LFS pointer detected in ${src}. Run: git lfs fetch --all && git lfs checkout")
  endif()
endfunction()

# Call for each *.cubin.cpp you add
check_lfs_resolved(${CMAKE_CURRENT_SOURCE_DIR}/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_q_kv_128_sm89.cubin.cpp)
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_bf16_64_128_S_qkv_128_tma_ws_sm90.cubin.cpp (1)

1-3: Optional: document style exception for generated cubin sources and exclude from linters.

File naming deviates from lowerCamelCase and content is machine‑generated; consider:

Documenting an exception for cubin-generated .cpp under cubin/ in coding guidelines.

Excluding these from clang‑tidy/format checks and static analysis to reduce noise and build time.

cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_fp16_64_128_S_q_kv_128_softmax_tma_ws_sm90.cubin.cpp (1)

2-3: Binary blob pointer bumped — verify Git LFS smudge in CI.

No functional/code changes. Please ensure CI smudges LFS before compilation and fails if pointer text is present; clang errors shown by static analysis are false positives when analyzing the pointer file.

Use the validation script provided in a sibling comment to gate builds on missing LFS smudge.

cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_fp16_fp32_64_128_S_qkv_128_softcapping_sm90.cubin.cpp (1)

2-3: LFS-only update confirmed.

Good to go. Please keep the CI guard to detect any lingering LFS pointers pre-build and confirm .gitattributes tracks these files under LFS.

cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_qkv_128_sage_64_32_32_output_fp16_sm89.cubin.cpp (1)

2-3: Pointer refresh only; ensure packaging picks up new blobs.

Approve. In addition to the LFS CI guard, verify wheel/package/install rules still include this path so the updated cubin ships with release/1.1.

cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_bf16_64_128_S_q_paged_kv_128_softcapping_tma_ws_sm90.cubin.cpp (1)

2-3: Paged‑KV cubin pointer updated — run a quick smoke test.

Change is LFS-only. Please run a minimal paged‑KV FMHA smoke test (sm90) to confirm loadability and no version skew with the runtime loader, in addition to the LFS CI guard.

cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_qkv_80_sage_64_32_32_output_bf16_sm89.cubin.cpp (1)

2-3: Ada (sm89) variant pointer bump — LGTM.

Approve the LFS pointer bump. Keep the LFS placeholder check in CI to prevent accidental compilation of pointer text.
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_64_256_S_qkv_128_alibi_tma_ws_sm90.cubin.cpp (1)
2-3: Add a pre-build guard to catch LFS pointer text early.

Consider a CI step that fails if any .cubin.cpp begins with the LFS header to avoid false clang errors.
#!/bin/bash
set -e
files=$(rg -l '^\s*version https://git-lfs.github.com/spec/v1' --glob '**/*.cubin.cpp' || true)
test -z "$files" || { echo "LFS pointers detected in:"; echo "$files"; exit 1; }
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_fp16_64_128_S_qkv_128_sm90.cubin.cpp (1)

2-3: Approved — LFS metadata only.

No code/API change. Ensure CI pulls LFS to avoid analyzing stubs.

cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_bf16_64_128_S_q_paged_kv_128_tma_ws_sm90.cubin.cpp (1)

2-3: LGTM: metadata-only change.

No runtime/control-flow changes. Ensure CI ignores .cubin.cpp files for C++ compilation and license headers to avoid false positives.

cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_128_128_S_qkv_32_sm89.cubin.cpp (1)

2-2: Silence static-analysis false positives for LFS stubs.

The clang errors are expected for LFS pointer text. Exclude path cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/*.cubin.cpp from compilation/analysis or mark as HEADER_FILE_ONLY in CMake to keep CI green.

...edMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_128_128_S_qkv_32_sm89.cubin.cpp

...iHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_q_paged_kv_128_sm89.cubin.cpp

tensorrt-cicd · 2025-10-20T09:41:15Z

PR_Github #21887 [ run ] triggered by Bot. Commit: e9e227a

tensorrt-cicd · 2025-10-20T12:40:28Z

PR_Github #21887 [ run ] completed with state SUCCESS. Commit: e9e227a
/LLM/release-1.1/L0_MergeRequest_PR pipeline #196 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

upgrade fmha_v2.

e9e227a

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq requested review from PerkzZheng, Tracin, Wanli-Jiang and crazydemo October 20, 2025 09:35

yuxianq requested a review from a team as a code owner October 20, 2025 09:35

PerkzZheng approved these changes Oct 20, 2025

View reviewed changes

crazydemo approved these changes Oct 20, 2025

View reviewed changes

yuxianq requested a review from litaotju October 20, 2025 09:36

coderabbitai bot reviewed Oct 20, 2025

View reviewed changes

...edMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_128_128_S_qkv_32_sm89.cubin.cpp Show resolved Hide resolved

...iHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_q_paged_kv_128_sm89.cubin.cpp Show resolved Hide resolved

litaotju approved these changes Oct 21, 2025

View reviewed changes

yuxianq merged commit 4faa515 into NVIDIA:release/1.1 Oct 21, 2025
8 of 9 checks passed

coderabbitai bot mentioned this pull request Oct 21, 2025

[https://nvbugs/5569719][fix] Gptoss sm120 cherrypick to release 1.1 #8544

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[https://nvbugs/5569081][fix] Upgrade fmha_v2. (cherry-pick from https://github.com/NVIDIA/TensorRT-LLM/pull/8364) #8499

[https://nvbugs/5569081][fix] Upgrade fmha_v2. (cherry-pick from https://github.com/NVIDIA/TensorRT-LLM/pull/8364) #8499

Uh oh!

yuxianq commented Oct 20, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

yuxianq commented Oct 20, 2025

Uh oh!

coderabbitai bot commented Oct 20, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Oct 20, 2025

Uh oh!

tensorrt-cicd commented Oct 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[https://nvbugs/5569081][fix] Upgrade fmha_v2. (cherry-pick from https://github.com/NVIDIA/TensorRT-LLM/pull/8364) #8499

[https://nvbugs/5569081][fix] Upgrade fmha_v2. (cherry-pick from https://github.com/NVIDIA/TensorRT-LLM/pull/8364) #8499

Uh oh!

Conversation

yuxianq commented Oct 20, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

yuxianq commented Oct 20, 2025

Uh oh!

coderabbitai bot commented Oct 20, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Oct 20, 2025

Uh oh!

tensorrt-cicd commented Oct 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yuxianq commented Oct 20, 2025 •

edited by coderabbitai bot

Loading