-
Notifications
You must be signed in to change notification settings - Fork 2k
[https://nvbugs/5569081][fix] Upgrade fmha_v2. (cherry-pick from https://github.com/NVIDIA/TensorRT-LLM/pull/8364) #8499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughThis pull request updates Git LFS pointers for 92 compiled CUDA binary files (cubin) used in TensorRT-LLM's fused multi-head attention kernels. Each file's metadata—specifically the blob identifier (oid sha256) and size—has been updated. No source code, logic, or public API changes are present. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Rationale: While this PR contains 92 file changes, they are entirely homogeneous—each consists solely of updating Git LFS pointer metadata (oid sha256 and file size) with no code logic, control flow, or functional changes. The repetitive nature of these binary asset updates significantly reduces review complexity. A reviewer can verify the pattern on a few representative files and spot-check that all updates are syntactically correct LFS pointer format changes. Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (13)
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_q_paged_kv_96_sm89.cubin.cpp (2)
1-3: Silence Clang/compile for LFS .cpp pointers to avoid false errors.Static analysis errors are expected here since these files aren’t C++ sources. Exclude them from compilation/tidy.
Examples:
- CMake
# Mark LFS pointer stubs as non-compilable sources file(GLOB CUBIN_POINTERS "${CMAKE_SOURCE_DIR}/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/*.cubin.cpp") set_source_files_properties(${CUBIN_POINTERS} PROPERTIES HEADER_FILE_ONLY ON)
- Or exclude the cubin directory from your clang-tidy/run-clang-tidy target (path filter/globs).
1-1: License header on .cpp (optional).These .cpp pointer stubs lack the NVIDIA Apache-2.0 header. Either add a brief header comment or formally exempt this path from the header check to satisfy policy. As per coding guidelines.
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_q_kv_128_sm89.cubin.cpp (1)
1-3: Optional: add a CMake guard to fail fast when LFS content isn’t present.To avoid hard-to-read compiler errors, add a small configure-time check that errors if a .cubin.cpp still contains the LFS header.
Example snippet you can adapt:
# In the CMake for this target: function(check_lfs_resolved src) file(READ "${src}" _contents LIMIT 64) string(FIND "${_contents}" "git-lfs.github.com/spec/v1" _pos) if(NOT _pos EQUAL -1) message(FATAL_ERROR "Unresolved Git LFS pointer detected in ${src}. Run: git lfs fetch --all && git lfs checkout") endif() endfunction() # Call for each *.cubin.cpp you add check_lfs_resolved(${CMAKE_CURRENT_SOURCE_DIR}/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_q_kv_128_sm89.cubin.cpp)cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_bf16_64_128_S_qkv_128_tma_ws_sm90.cubin.cpp (1)
1-3: Optional: document style exception for generated cubin sources and exclude from linters.File naming deviates from lowerCamelCase and content is machine‑generated; consider:
- Documenting an exception for cubin-generated .cpp under cubin/ in coding guidelines.
- Excluding these from clang‑tidy/format checks and static analysis to reduce noise and build time.
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_fp16_64_128_S_q_kv_128_softmax_tma_ws_sm90.cubin.cpp (1)
2-3: Binary blob pointer bumped — verify Git LFS smudge in CI.No functional/code changes. Please ensure CI smudges LFS before compilation and fails if pointer text is present; clang errors shown by static analysis are false positives when analyzing the pointer file.
Use the validation script provided in a sibling comment to gate builds on missing LFS smudge.
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_fp16_fp32_64_128_S_qkv_128_softcapping_sm90.cubin.cpp (1)
2-3: LFS-only update confirmed.Good to go. Please keep the CI guard to detect any lingering LFS pointers pre-build and confirm .gitattributes tracks these files under LFS.
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_qkv_128_sage_64_32_32_output_fp16_sm89.cubin.cpp (1)
2-3: Pointer refresh only; ensure packaging picks up new blobs.Approve. In addition to the LFS CI guard, verify wheel/package/install rules still include this path so the updated cubin ships with release/1.1.
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_bf16_64_128_S_q_paged_kv_128_softcapping_tma_ws_sm90.cubin.cpp (1)
2-3: Paged‑KV cubin pointer updated — run a quick smoke test.Change is LFS-only. Please run a minimal paged‑KV FMHA smoke test (sm90) to confirm loadability and no version skew with the runtime loader, in addition to the LFS CI guard.
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_qkv_80_sage_64_32_32_output_bf16_sm89.cubin.cpp (1)
2-3: Ada (sm89) variant pointer bump — LGTM.Approve the LFS pointer bump. Keep the LFS placeholder check in CI to prevent accidental compilation of pointer text.
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_64_256_S_qkv_128_alibi_tma_ws_sm90.cubin.cpp (1)
2-3: Add a pre-build guard to catch LFS pointer text early.Consider a CI step that fails if any .cubin.cpp begins with the LFS header to avoid false clang errors.
#!/bin/bash set -e files=$(rg -l '^\s*version https://git-lfs.github.com/spec/v1' --glob '**/*.cubin.cpp' || true) test -z "$files" || { echo "LFS pointers detected in:"; echo "$files"; exit 1; }cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_fp16_64_128_S_qkv_128_sm90.cubin.cpp (1)
2-3: Approved — LFS metadata only.No code/API change. Ensure CI pulls LFS to avoid analyzing stubs.
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_bf16_64_128_S_q_paged_kv_128_tma_ws_sm90.cubin.cpp (1)
2-3: LGTM: metadata-only change.No runtime/control-flow changes. Ensure CI ignores
.cubin.cppfiles for C++ compilation and license headers to avoid false positives.cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_128_128_S_qkv_32_sm89.cubin.cpp (1)
2-2: Silence static-analysis false positives for LFS stubs.The clang errors are expected for LFS pointer text. Exclude path cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/*.cubin.cpp from compilation/analysis or mark as HEADER_FILE_ONLY in CMake to keep CI green.
...edMultiHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_128_128_S_qkv_32_sm89.cubin.cpp
Show resolved
Hide resolved
...iHeadAttention/cubin/fmha_v2_flash_attention_e4m3_fp32_64_32_S_q_paged_kv_128_sm89.cubin.cpp
Show resolved
Hide resolved
|
PR_Github #21887 [ run ] triggered by Bot. Commit: |
|
PR_Github #21887 [ run ] completed with state |
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.