[TRTLLM-8768][chore] Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt #8771

chang-l · 2025-10-30T00:15:34Z

Perf implication:

Without this PR:
concurrency=64; ISL/OSL=1K/2K; DEP=8:

Request Throughput (req/sec):                     0.9469
Total Output Throughput (tokens/sec):             1939.3511
Total Token Throughput (tokens/sec):              2926.8559
Total Latency (ms):                               67585.4940
Average request latency (ms):                     67518.0704
Per User Output Throughput [w/ ctx] (tps/user):   30.3326
Per GPU Output Throughput (tps/gpu):              242.4189

With this PR (~1.02X):
concurrency=64; ISL/OSL=1K/2K; DEP=8:

Request Throughput (req/sec):                     0.9670
Total Output Throughput (tokens/sec):             1980.3425
Total Token Throughput (tokens/sec):              2988.7198
Total Latency (ms):                               66186.5319
Average request latency (ms):                     66121.9604
Per User Output Throughput [w/ ctx] (tps/user):   30.9731
Per GPU Output Throughput (tps/gpu):              247.5428

Summary by CodeRabbit

Release Notes

New Features
- Added support for DeepseekV3.2 attention variant.
Refactor
- Optimized sparse attention backend with improved indexer-based attention pathway for enhanced performance.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-10-30T00:18:50Z

📝 Walkthrough

Walkthrough

The changes refactor the sparse attention backend to externalize indexer_k and indexer_weights parameters, removing corresponding internal weight projection layers from the Indexer class. A new DeepseekV32Attention class is introduced to handle the updated architecture, and the attention module is updated to process the new parameters through the indexer pipeline.

Changes

Cohort / File(s)	Change Summary
DSA sparse attention backend refactoring `tensorrt_llm/_torch/attention_backend/sparse/dsa.py`	Removes internal `self.wk` and `self.weights_proj` layer initializations from Indexer class. Updates DSATrtllmAttention.forward and Indexer.forward signatures to accept external `indexer_k` and `indexer_weights` tensors, replacing the previous internal weight projection pathway.
DeepseekV3 model attention architecture `tensorrt_llm/_torch/models/modeling_deepseekv3.py`	Introduces new DeepseekV32Attention class with expanded `kv_a_proj_with_mqa` projection to include indexer.wk and indexer.weights_proj components. Updates DeepseekV3DecoderLayer.init to conditionally instantiate DeepseekV32Attention when config.model_type equals "deepseek_v32", and enhances weight loading logic for v32 to fuse indexer-related projections.
Attention module integration `tensorrt_llm/_torch/modules/attention.py`	Modifies forward_impl_with_dsa to split additional tensors (indexer_k, indexer_weights) from `kv_a_proj_with_mqa` call, applies k_norm to indexer_k, and replaces parallel latent cache projection with explicit top-k indexer search using the new external parameters.

Sequence Diagram

sequenceDiagram
    participant Attention as Attention Module
    participant KVProj as kv_a_proj_with_mqa
    participant Indexer as Indexer
    participant QBProj as q_b_proj

    rect rgb(200, 220, 240)
    Note over Attention,KVProj: Before (internal projections)
    Attention->>KVProj: Input hidden_states
    KVProj-->>Attention: q, compressed_kv, k_pe, (internal wk, weights_proj)
    end

    rect rgb(240, 220, 200)
    Note over Attention,Indexer: After (externalized parameters)
    Attention->>KVProj: Input hidden_states
    KVProj-->>Attention: q, compressed_kv, k_pe, indexer_k, indexer_weights
    Attention->>Indexer: indexer_k, indexer_weights, qr, hidden_states
    Indexer-->>Attention: topk_indices
    Attention->>QBProj: q
    QBProj-->>Attention: projected q
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Interconnected changes across three files: The refactoring requires understanding how DSA backend changes integrate with DeepseekV3 model updates and the attention module's forward pass logic.
Weight loading fusion logic: The v32 weight assembly in kv_a_proj_with_mqa requires careful verification that indexer projections are correctly fused with proper dtype consistency.
Tensor dimension calculations: The split sizes and concatenation operations in attention.py depend on correct dimension tracking across indexer.head_dim and indexer.n_heads.
Conditional model instantiation: DeepseekV3DecoderLayer's new branching logic based on config.model_type requires verification that both branches handle weight loading consistently.
Method signature changes: Public-facing forward methods now accept new parameters; verify all call sites pass the correct tensors.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	⚠️ Warning	The pull request description is largely incomplete and does not follow the required template structure. While the author provided performance metrics showing approximately 1.02X throughput improvement, the critical template sections are left unfilled: the PR title does not follow the required format [JIRA ticket/NVBugs ID/GitHub issue/None][type] Summary, the "Description" section lacks explanation of what the PR does and why the changes are needed, the "Test Coverage" section is empty with no test cases identified, and most items in the "PR Checklist" are unchecked or unaddressed. Only the final checkbox item is marked complete, leaving the substantive requirements of the template unmet.	The author should provide a complete PR description by filling in all required template sections: add a proper PR title following the specified format, provide a concise explanation of the issue being addressed and the solution implemented, clearly list the relevant test cases that safeguard the changes, and properly review and address all checklist items (PR description clarity, coding guidelines compliance, test coverage, dependency scanning, CODEOWNERS updates, and documentation). The performance metrics provided should be incorporated into the description to support the rationale for the changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "[TRTLLM-8768][chore] Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt" is related to the main changes in the changeset. The title accurately references the reorganization of indexer projections (specifically indexer K and weight_proj) that are being externalized as parameters and incorporated into the forward computation path across three modified files. The changes demonstrate a technical restructuring where these components are fused into the forward pass, particularly for DeepseekV3.2 attention mechanisms. The title is sufficiently specific and concise to indicate the primary technical nature of the change without being misleading or off-topic.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chang-l · 2025-10-30T06:21:28Z

/bot run

tensorrt-cicd · 2025-10-30T06:30:14Z

PR_Github #22997 [ run ] triggered by Bot. Commit: 8629342

tensorrt-cicd · 2025-10-30T07:40:39Z

PR_Github #22997 [ run ] completed with state SUCCESS. Commit: 8629342
/LLM/main/L0_MergeRequest_PR pipeline #17338 completed with status: 'FAILURE'

lfr-0531 · 2025-10-30T09:08:09Z

/bot run

tensorrt-cicd · 2025-10-30T09:14:24Z

PR_Github #23027 [ run ] triggered by Bot. Commit: eda9f00

tensorrt-cicd · 2025-10-30T11:30:41Z

PR_Github #23027 [ run ] completed with state SUCCESS. Commit: eda9f00
/LLM/main/L0_MergeRequest_PR pipeline #17362 completed with status: 'FAILURE'

chang-l · 2025-10-30T16:59:00Z

/bot run

tensorrt-cicd · 2025-10-30T17:05:17Z

PR_Github #23075 [ run ] triggered by Bot. Commit: 67e4bb6

tensorrt-cicd · 2025-10-31T01:19:50Z

PR_Github #23075 [ run ] completed with state SUCCESS. Commit: 67e4bb6
/LLM/main/L0_MergeRequest_PR pipeline #17401 completed with status: 'SUCCESS'

chang-l · 2025-10-31T01:55:58Z

/bot run

tensorrt-cicd · 2025-10-31T02:01:58Z

PR_Github #23098 [ run ] triggered by Bot. Commit: 5428e12

tensorrt-cicd · 2025-10-31T13:39:36Z

PR_Github #23098 [ run ] completed with state SUCCESS. Commit: 5428e12
/LLM/main/L0_MergeRequest_PR pipeline #17421 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

tensorrt_llm/_torch/attention_backend/sparse/dsa.py

tensorrt_llm/_torch/models/modeling_deepseekv3.py

chang-l · 2025-11-04T00:31:36Z

/bot run

tensorrt-cicd · 2025-11-04T00:37:44Z

PR_Github #23431 [ run ] triggered by Bot. Commit: a4be257

tensorrt-cicd · 2025-11-04T02:13:42Z

PR_Github #23431 [ run ] completed with state SUCCESS. Commit: a4be257
/LLM/main/L0_MergeRequest_PR pipeline #17646 completed with status: 'FAILURE'

chang-l · 2025-11-04T03:27:33Z

/bot run

Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>

chang-l · 2025-11-04T16:49:05Z

/bot run

tensorrt-cicd · 2025-11-04T16:58:41Z

PR_Github #23527 [ run ] triggered by Bot. Commit: 0665d62

tensorrt-cicd · 2025-11-04T18:29:09Z

PR_Github #23527 [ run ] completed with state SUCCESS. Commit: 0665d62
/LLM/main/L0_MergeRequest_PR pipeline #17707 completed with status: 'FAILURE'

Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>

tensorrt_llm/_torch/models/modeling_deepseekv3.py

chang-l · 2025-11-05T07:19:27Z

/bot run

tensorrt-cicd · 2025-11-05T07:25:09Z

PR_Github #23610 [ run ] triggered by Bot. Commit: faa5719

tensorrt-cicd · 2025-11-05T11:34:53Z

PR_Github #23610 [ run ] completed with state SUCCESS. Commit: faa5719
/LLM/main/L0_MergeRequest_PR pipeline #17764 completed with status: 'SUCCESS'

chang-l requested review from a team as code owners October 30, 2025 00:15

chang-l requested review from hlu1 and symphonylyh October 30, 2025 00:15

chang-l changed the title ~~[TRTLLM-8768][chore] Fuse QK down_proj GEMM with indexer K and weight_proj GEMM (NVFP4 only)~~ [TRTLLM-8768][perf] Fuse QK down_proj GEMM with indexer K and weight_proj GEMM (NVFP4 only) Oct 30, 2025

chang-l requested a review from lfr-0531 October 30, 2025 00:16

chang-l marked this pull request as draft October 30, 2025 00:17

chang-l changed the title ~~[TRTLLM-8768][perf] Fuse QK down_proj GEMM with indexer K and weight_proj GEMM (NVFP4 only)~~ [TRTLLM-8768][chore] Fuse QK down_proj GEMM with indexer K and weight_proj GEMM (NVFP4 only) Oct 30, 2025

chang-l changed the title ~~[TRTLLM-8768][chore] Fuse QK down_proj GEMM with indexer K and weight_proj GEMM (NVFP4 only)~~ [TRTLLM-8768][chore] Fuse QK down_proj GEMM with indexer K and weight_proj GEMM for FP4 ckpt Oct 30, 2025

chang-l marked this pull request as ready for review October 30, 2025 06:21

chang-l mentioned this pull request Oct 30, 2025

[TRTLLM-8772][chore] Fuse indexer q proj into MLA Q up_proj #8816

Draft

1 task

chang-l changed the title ~~[TRTLLM-8768][chore] Fuse QK down_proj GEMM with indexer K and weight_proj GEMM for FP4 ckpt~~ [TRTLLM-8768][chore] Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt Oct 31, 2025

lfr-0531 requested a review from QiJune October 31, 2025 05:52

lfr-0531 reviewed Nov 3, 2025

View reviewed changes

tensorrt_llm/_torch/attention_backend/sparse/dsa.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/attention_backend/sparse/dsa.py Outdated Show resolved Hide resolved

yuxianq reviewed Nov 3, 2025

View reviewed changes

tensorrt_llm/_torch/attention_backend/sparse/dsa.py Outdated Show resolved Hide resolved

yuxianq reviewed Nov 3, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_deepseekv3.py Show resolved Hide resolved

yuxianq reviewed Nov 3, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_deepseekv3.py Outdated Show resolved Hide resolved

chang-l force-pushed the fuse_fusedA_wlk_wproj_dsv32 branch from 5428e12 to adc7948 Compare November 3, 2025 18:35

yuxianq approved these changes Nov 4, 2025

View reviewed changes

chang-l added 9 commits November 4, 2025 08:48

Fuse fused A gemm with indexer k and weight_proj for nvfp4

5bd1d60

Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>

Fix bug in attention

ea765a6

Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>

slight refactor

2631a2c

Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>

Fix tests and minor update signature

543ce9c

Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>

Use torch compile and better multistream for ops before index fwd

f37bc91

Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>

Address comments

dc970fe

Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>

simplify logic

6954956

Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>

Minor update on comment

957b065

Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>

Fix test

0665d62

Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>

chang-l force-pushed the fuse_fusedA_wlk_wproj_dsv32 branch from 059d1d7 to 0665d62 Compare November 4, 2025 16:48

Minor update on weight_scale

faa5719

Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>

syuoni reviewed Nov 5, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_deepseekv3.py Show resolved Hide resolved

tensorrt_llm/_torch/models/modeling_deepseekv3.py Show resolved Hide resolved

lfr-0531 approved these changes Nov 5, 2025

View reviewed changes

syuoni approved these changes Nov 5, 2025

View reviewed changes

chang-l merged commit e57d83c into NVIDIA:main Nov 5, 2025
5 checks passed

[TRTLLM-8768][chore] Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt #8771

[TRTLLM-8768][chore] Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt #8771

Uh oh!

Conversation

chang-l commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

chang-l commented Oct 30, 2025

Uh oh!

tensorrt-cicd commented Oct 30, 2025

Uh oh!

tensorrt-cicd commented Oct 30, 2025

Uh oh!

lfr-0531 commented Oct 30, 2025

Uh oh!

tensorrt-cicd commented Oct 30, 2025

Uh oh!

tensorrt-cicd commented Oct 30, 2025

Uh oh!

chang-l commented Oct 30, 2025

Uh oh!

tensorrt-cicd commented Oct 30, 2025

Uh oh!

tensorrt-cicd commented Oct 31, 2025

Uh oh!

chang-l commented Oct 31, 2025

Uh oh!

tensorrt-cicd commented Oct 31, 2025

Uh oh!

tensorrt-cicd commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chang-l commented Nov 4, 2025

Uh oh!

tensorrt-cicd commented Nov 4, 2025

Uh oh!

tensorrt-cicd commented Nov 4, 2025

Uh oh!

chang-l commented Nov 4, 2025

Uh oh!

chang-l commented Nov 4, 2025

Uh oh!

tensorrt-cicd commented Nov 4, 2025

Uh oh!

tensorrt-cicd commented Nov 4, 2025

Uh oh!

Uh oh!

Uh oh!

chang-l commented Nov 5, 2025

Uh oh!

tensorrt-cicd commented Nov 5, 2025

Uh oh!

tensorrt-cicd commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

chang-l commented Oct 30, 2025 •

edited

Loading

coderabbitai bot commented Oct 30, 2025 •

edited

Loading