[None][feat] Add customized topk and related unit tests for DSA #8882

ChristinaZ · 2025-11-03T12:20:30Z

Summary by CodeRabbit

Release Notes

New Features
- Added custom CUDA kernel implementation for efficient Top-K indexing operations in sparse attention
- Introduced use_custom_topk parameter enabling optimized Top-K selection path during attention computation
- Supports both prefill and decode phases with specialized kernel implementations
Tests
- Added comprehensive test coverage comparing custom kernel implementations against PyTorch fallback paths across multiple scenarios

Description

Add the customized topk kernels and related unit tests for DSA

Test Coverage

pytest -v -s tests/unittest/_torch/thop/parallel/test_indexer_topk.py
pytest -v -s tests/unittest/_torch/attention/sparse/test_dsa_indexer.py

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-11-03T12:29:33Z

📝 Walkthrough

Walkthrough

This PR introduces a custom CUDA top-K indexer kernel for sparse attention in TensorRT-LLM. It includes kernel implementation, Torch extension bindings, integration into the sparse attention backend with optional fallback, and comprehensive test coverage validating the custom kernel against PyTorch fallback paths across prefill and decode scenarios.

Changes

Cohort / File(s)	Summary
CUDA Kernel Implementation `cpp/tensorrt_llm/kernels/IndexerTopK.h`, `cpp/tensorrt_llm/kernels/indexerTopK.cu`	Declares and implements two public top-K indexer kernels: `invokeIndexerTopKDecode` for decoding with sequence lengths, and `invokeIndexerTopKPrefill` for prefilling with explicit row ranges. Kernel uses histogram binning, prefix-sum, and radix-sort operations to compute top-K indices per row with fixed width of 2048.
Torch Extension Binding `cpp/tensorrt_llm/thop/IndexerTopKOp.cpp`	Exposes two CUDA top-K operations as Torch extensions: `indexer_topk_decode_op` and `indexer_topk_prefill_op`, validating input tensors, extracting CUDA stream, and invoking corresponding C++ kernels.
Build Configuration `cpp/tensorrt_llm/thop/CMakeLists.txt`	Adds `IndexerTopKOp.cpp` to the `th_common` shared library source list.
Python Integration `tensorrt_llm/_torch/attention_backend/sparse/dsa.py`	Adds `use_custom_topk` parameter (default True) to `sparse_attn_indexer` method, branching to custom CUDA kernels (`indexer_topk_prefill_op` / `indexer_topk_decode_op`) during prefill/decode when enabled, with PyTorch fallback as alternative.
Test Suite `tests/unittest/_torch/attention/sparse/test_dsa_indexer.py`	Enhanced DSA indexer tests with Jaccard similarity-based comparison metrics replacing exact-match assertions, CPU/CUDA fallback handling for sequence lengths, and new test cases validating custom vs. fallback paths across chunked/single-pass prefill and decode scenarios.
Top-K Kernel Unit Tests `tests/unittest/_torch/thop/parallel/test_indexer_topk.py`	New test module with parametrized tests for `indexer_topk_decode_op` and `indexer_topk_prefill_op`, including helper functions for logits generation, per-row top-K validation, and cross-validation against PyTorch reference implementations.

Sequence Diagram(s)

sequenceDiagram
    actor Python as Python Code
    participant DSA as sparse_attn_indexer<br/>(dsa.py)
    participant Torch as Torch Op<br/>(IndexerTopKOp.cpp)
    participant CUDA as CUDA Kernel<br/>(indexerTopK.cu)

    Python->>DSA: sparse_attn_indexer(metadata, ..., use_custom_topk=True)
    
    alt use_custom_topk == True
        DSA->>Torch: torch.ops.trtllm.indexer_topk_decode_op<br/>(logits, seq_lens, indices, ...)
        Torch->>CUDA: invokeIndexerTopKDecode(logits, seqLens,<br/>outIndices, numRows, ...)
        CUDA-->>Torch: Top-K indices computed
        Torch-->>DSA: Return
    else use_custom_topk == False
        DSA->>DSA: Use PyTorch topk() fallback
    end
    
    DSA-->>Python: topk_indices_buffer

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

CUDA kernel implementation (indexerTopK.cu): Dense algorithmic logic involving histogram binning with custom bin extraction, prefix-sum computation via cub::BlockScan, threshold determination, and final radix-sort. Requires careful validation of edge cases (rows ≤ top-K, bin overflow, shared memory usage).
Control flow branching (dsa.py): New conditional path on use_custom_topk affects prefill (chunked and non-chunked) and decode stages; careful attention needed to ensure equivalent behavior between branches.
Test integration (test_dsa_indexer.py): Shift from exact-match to Jaccard similarity metrics introduces new validation semantics; verify similarity threshold appropriateness (95%) and low-similarity diagnostics.
Cross-layer integration: Torch bindings, C++ kernel invocation, and Python dispatch all depend on correct parameter passing and tensor memory layout.

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Linked Issues check	❓ Inconclusive	The pull request title uses '[None]' to indicate no linked JIRA ticket, GitHub issue, or NVBugs ID. While this is valid according to the template, it means there is no external issue or requirement reference to provide context or traceability for this feature. The PR description does not reference any issue or document why these customized kernels are needed.	Consider linking a relevant GitHub issue or JIRA ticket if this PR addresses a tracked requirement or performance improvement. If no external issue exists, the PR description could be enhanced by explaining the motivation and benefits of adding these custom kernels (e.g., performance improvements, reduced latency, specific use-case support for DSA).

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title '[None][feat] Add customized topk and related unit tests for DSA' follows the required template format with [None] for no ticket and [feat] for feature type. It clearly describes the main change: adding customized topk kernels and related unit tests for DSA (Distributed Sparse Attention). The title is specific, concise, and accurately reflects the primary purpose of the changeset.
Description check	✅ Passed	The pull request description includes a clear summary of what is being added ('Add the customized topk kernels and related unit tests for DSA'), provides specific test coverage commands for the two main test modules added, and marks the PR checklist as complete. However, it lacks detail on the 'why' aspect—the rationale for why these customized kernels are needed compared to alternatives, and how they improve performance or functionality. Despite this gap, the core required sections are present and the description adequately communicates the main changes.
Out of Scope Changes check	✅ Passed	The changeset is focused and well-scoped: it adds new CUDA kernels (IndexerTopK.h/.cu) for top-K indexing, Torch bindings (IndexerTopKOp.cpp), integrates them into the DSA sparse attention backend (dsa.py with a new use_custom_topk parameter), and includes comprehensive unit tests (test_indexer_topk.py and test_dsa_indexer.py updates). CMakeLists.txt was minimally updated to include the new source file. All changes are directly related to the stated objective of adding customized topk kernels and tests for DSA.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d717676 and 8f4f53b.

📒 Files selected for processing (7)

cpp/tensorrt_llm/kernels/IndexerTopK.h (1 hunks)
cpp/tensorrt_llm/kernels/indexerTopK.cu (1 hunks)
cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/thop/IndexerTopKOp.cpp (1 hunks)
tensorrt_llm/_torch/attention_backend/sparse/dsa.py (4 hunks)
tests/unittest/_torch/attention/sparse/test_dsa_indexer.py (3 hunks)
tests/unittest/_torch/thop/parallel/test_indexer_topk.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}