Skip to content

Conversation

@chang-l
Copy link
Collaborator

@chang-l chang-l commented Oct 30, 2025

Perf implication:

Without this PR:
concurrency=64; ISL/OSL=1K/2K; DEP=8:

Request Throughput (req/sec):                     0.9469
Total Output Throughput (tokens/sec):             1939.3511
Total Token Throughput (tokens/sec):              2926.8559
Total Latency (ms):                               67585.4940
Average request latency (ms):                     67518.0704
Per User Output Throughput [w/ ctx] (tps/user):   30.3326
Per GPU Output Throughput (tps/gpu):              242.4189

With this PR (~1.02X):
concurrency=64; ISL/OSL=1K/2K; DEP=8:

Request Throughput (req/sec):                     0.9670
Total Output Throughput (tokens/sec):             1980.3425
Total Token Throughput (tokens/sec):              2988.7198
Total Latency (ms):                               66186.5319
Average request latency (ms):                     66121.9604
Per User Output Throughput [w/ ctx] (tps/user):   30.9731
Per GPU Output Throughput (tps/gpu):              247.5428

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for DeepseekV3.2 attention variant.
  • Refactor

    • Optimized sparse attention backend with improved indexer-based attention pathway for enhanced performance.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@chang-l chang-l requested review from a team as code owners October 30, 2025 00:15
@chang-l chang-l requested review from hlu1 and symphonylyh October 30, 2025 00:15
@chang-l chang-l changed the title [TRTLLM-8768][chore] Fuse QK down_proj GEMM with indexer K and weight_proj GEMM (NVFP4 only) [TRTLLM-8768][perf] Fuse QK down_proj GEMM with indexer K and weight_proj GEMM (NVFP4 only) Oct 30, 2025
@chang-l chang-l requested a review from lfr-0531 October 30, 2025 00:16
@chang-l chang-l marked this pull request as draft October 30, 2025 00:17
@chang-l chang-l changed the title [TRTLLM-8768][perf] Fuse QK down_proj GEMM with indexer K and weight_proj GEMM (NVFP4 only) [TRTLLM-8768][chore] Fuse QK down_proj GEMM with indexer K and weight_proj GEMM (NVFP4 only) Oct 30, 2025
@chang-l chang-l changed the title [TRTLLM-8768][chore] Fuse QK down_proj GEMM with indexer K and weight_proj GEMM (NVFP4 only) [TRTLLM-8768][chore] Fuse QK down_proj GEMM with indexer K and weight_proj GEMM for FP4 ckpt Oct 30, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 30, 2025

📝 Walkthrough

Walkthrough

The changes refactor the sparse attention backend to externalize indexer_k and indexer_weights parameters, removing corresponding internal weight projection layers from the Indexer class. A new DeepseekV32Attention class is introduced to handle the updated architecture, and the attention module is updated to process the new parameters through the indexer pipeline.

Changes

Cohort / File(s) Change Summary
DSA sparse attention backend refactoring
tensorrt_llm/_torch/attention_backend/sparse/dsa.py
Removes internal self.wk and self.weights_proj layer initializations from Indexer class. Updates DSATrtllmAttention.forward and Indexer.forward signatures to accept external indexer_k and indexer_weights tensors, replacing the previous internal weight projection pathway.
DeepseekV3 model attention architecture
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Introduces new DeepseekV32Attention class with expanded kv_a_proj_with_mqa projection to include indexer.wk and indexer.weights_proj components. Updates DeepseekV3DecoderLayer.init to conditionally instantiate DeepseekV32Attention when config.model_type equals "deepseek_v32", and enhances weight loading logic for v32 to fuse indexer-related projections.
Attention module integration
tensorrt_llm/_torch/modules/attention.py
Modifies forward_impl_with_dsa to split additional tensors (indexer_k, indexer_weights) from kv_a_proj_with_mqa call, applies k_norm to indexer_k, and replaces parallel latent cache projection with explicit top-k indexer search using the new external parameters.

Sequence Diagram

sequenceDiagram
    participant Attention as Attention Module
    participant KVProj as kv_a_proj_with_mqa
    participant Indexer as Indexer
    participant QBProj as q_b_proj

    rect rgb(200, 220, 240)
    Note over Attention,KVProj: Before (internal projections)
    Attention->>KVProj: Input hidden_states
    KVProj-->>Attention: q, compressed_kv, k_pe, (internal wk, weights_proj)
    end

    rect rgb(240, 220, 200)
    Note over Attention,Indexer: After (externalized parameters)
    Attention->>KVProj: Input hidden_states
    KVProj-->>Attention: q, compressed_kv, k_pe, indexer_k, indexer_weights
    Attention->>Indexer: indexer_k, indexer_weights, qr, hidden_states
    Indexer-->>Attention: topk_indices
    Attention->>QBProj: q
    QBProj-->>Attention: projected q
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • Interconnected changes across three files: The refactoring requires understanding how DSA backend changes integrate with DeepseekV3 model updates and the attention module's forward pass logic.
  • Weight loading fusion logic: The v32 weight assembly in kv_a_proj_with_mqa requires careful verification that indexer projections are correctly fused with proper dtype consistency.
  • Tensor dimension calculations: The split sizes and concatenation operations in attention.py depend on correct dimension tracking across indexer.head_dim and indexer.n_heads.
  • Conditional model instantiation: DeepseekV3DecoderLayer's new branching logic based on config.model_type requires verification that both branches handle weight loading consistently.
  • Method signature changes: Public-facing forward methods now accept new parameters; verify all call sites pass the correct tensors.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The pull request description is largely incomplete and does not follow the required template structure. While the author provided performance metrics showing approximately 1.02X throughput improvement, the critical template sections are left unfilled: the PR title does not follow the required format [JIRA ticket/NVBugs ID/GitHub issue/None][type] Summary, the "Description" section lacks explanation of what the PR does and why the changes are needed, the "Test Coverage" section is empty with no test cases identified, and most items in the "PR Checklist" are unchecked or unaddressed. Only the final checkbox item is marked complete, leaving the substantive requirements of the template unmet. The author should provide a complete PR description by filling in all required template sections: add a proper PR title following the specified format, provide a concise explanation of the issue being addressed and the solution implemented, clearly list the relevant test cases that safeguard the changes, and properly review and address all checklist items (PR description clarity, coding guidelines compliance, test coverage, dependency scanning, CODEOWNERS updates, and documentation). The performance metrics provided should be incorporated into the description to support the rationale for the changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "[TRTLLM-8768][chore] Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt" is related to the main changes in the changeset. The title accurately references the reorganization of indexer projections (specifically indexer K and weight_proj) that are being externalized as parameters and incorporated into the forward computation path across three modified files. The changes demonstrate a technical restructuring where these components are fused into the forward pass, particularly for DeepseekV3.2 attention mechanisms. The title is sufficiently specific and concise to indicate the primary technical nature of the change without being misleading or off-topic.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chang-l chang-l marked this pull request as ready for review October 30, 2025 06:21
@chang-l
Copy link
Collaborator Author

chang-l commented Oct 30, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22997 [ run ] triggered by Bot. Commit: 8629342

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22997 [ run ] completed with state SUCCESS. Commit: 8629342
/LLM/main/L0_MergeRequest_PR pipeline #17338 completed with status: 'FAILURE'

@lfr-0531
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23027 [ run ] triggered by Bot. Commit: eda9f00

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23027 [ run ] completed with state SUCCESS. Commit: eda9f00
/LLM/main/L0_MergeRequest_PR pipeline #17362 completed with status: 'FAILURE'

@chang-l
Copy link
Collaborator Author

chang-l commented Oct 30, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23075 [ run ] triggered by Bot. Commit: 67e4bb6

@chang-l chang-l changed the title [TRTLLM-8768][chore] Fuse QK down_proj GEMM with indexer K and weight_proj GEMM for FP4 ckpt [TRTLLM-8768][chore] Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt Oct 31, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #23075 [ run ] completed with state SUCCESS. Commit: 67e4bb6
/LLM/main/L0_MergeRequest_PR pipeline #17401 completed with status: 'SUCCESS'

@chang-l
Copy link
Collaborator Author

chang-l commented Oct 31, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23098 [ run ] triggered by Bot. Commit: 5428e12

@lfr-0531 lfr-0531 requested a review from QiJune October 31, 2025 05:52
@tensorrt-cicd
Copy link
Collaborator

PR_Github #23098 [ run ] completed with state SUCCESS. Commit: 5428e12
/LLM/main/L0_MergeRequest_PR pipeline #17421 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@chang-l chang-l force-pushed the fuse_fusedA_wlk_wproj_dsv32 branch from 5428e12 to adc7948 Compare November 3, 2025 18:35
@chang-l
Copy link
Collaborator Author

chang-l commented Nov 4, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23431 [ run ] triggered by Bot. Commit: a4be257

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23431 [ run ] completed with state SUCCESS. Commit: a4be257
/LLM/main/L0_MergeRequest_PR pipeline #17646 completed with status: 'FAILURE'

@chang-l
Copy link
Collaborator Author

chang-l commented Nov 4, 2025

/bot run

Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
@chang-l chang-l force-pushed the fuse_fusedA_wlk_wproj_dsv32 branch from 059d1d7 to 0665d62 Compare November 4, 2025 16:48
@chang-l
Copy link
Collaborator Author

chang-l commented Nov 4, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23527 [ run ] triggered by Bot. Commit: 0665d62

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23527 [ run ] completed with state SUCCESS. Commit: 0665d62
/LLM/main/L0_MergeRequest_PR pipeline #17707 completed with status: 'FAILURE'

Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
@chang-l
Copy link
Collaborator Author

chang-l commented Nov 5, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23610 [ run ] triggered by Bot. Commit: faa5719

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23610 [ run ] completed with state SUCCESS. Commit: faa5719
/LLM/main/L0_MergeRequest_PR pipeline #17764 completed with status: 'SUCCESS'

@chang-l chang-l merged commit e57d83c into NVIDIA:main Nov 5, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants