[TRTLLM-8160][feat] Add draft token tree runtime on CDL #8586

yweng0828 · 2025-10-22T15:12:02Z

Description

In this PR, we implemented the runtime logic for the draft token tree. Given the improved performance of capturable drafting loops (CDL), our implementation is also based on CDL. A non-CDL draft PR is available here, but it's not considered for merging: #8109

Key idea:
For each draft layer, we take max_total_draft_token + 1 draft tokens as input (using self.draft_tokens_buffer in TreeDraftingLoopWrapper). This is for compatibility with the CUDA graph. Although this introduces additional redundant computations, the overhead may relatively small, considering that the computational load of the draft layer is not large.

These draft tokens are not all meaningful initially. During the execution of draft layers, we continuously update the values of these draft tokens (using TreeDraftingLoopWrapper::extract_real_draft_tokens).

What we have done in this PR:

Add draft token tree with static tree;
Refactor drafting_loop (LinearDraftingLoopWrapper, TreeDraftingLoopWrapper, );
Add CUDA Graph support;
Verify that this PR does not impact other existing functionality.

Limitations
The program currently has some bugs that can cause illegal memory access during key-value cache rewinding. Therefore, the static tree is not yet fully ready, but we are close.

Next TODO:

Fix KV cache rewind bugs;

Workflow

Changes in the type of attention kernel used.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

yweng0828 · 2025-10-26T13:40:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-10-26T13:46:15Z

PR_Github #22532 [ run ] triggered by Bot. Commit: ef8b2b6

coderabbitai · 2025-10-26T13:50:34Z

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive tree-based speculative decoding enhancements across TensorRT-LLM. The changes refactor attention metadata interfaces, drafter model logic, and spec-decoding parameter handling to support dynamic tree-based token generation with explicit batch-level and resource management, including new buffer allocation strategies, tree-aware position tracking, and revised sampling paths.

Changes

Cohort / File(s)	Summary
Attention Backend Refactoring `tensorrt_llm/_torch/attention_backend/interface.py`, `tensorrt_llm/_torch/attention_backend/trtllm.py`	Updated `update_spec_dec_param` signature from `is_spec_dec_tree`/`is_spec_dec_dynamic_tree` flags to batch-aware parameters (`batch_size`, `is_spec_decoding_enabled`, `spec_metadata`, `spec_tree_manager`, `max_draft_len`, `max_total_draft_tokens`). Reworked internal logic to decouple spec-dec mode from fixed flags and conditionally populate buffers (position_offsets, packed_mask) based on tree type; introduced parameterized helper methods for generating spec-decoding tensors.
PyExecutor Integration `tensorrt_llm/_torch/pyexecutor/model_engine.py`, `tensorrt_llm/_torch/pyexecutor/resource_manager.py`, `tensorrt_llm/_torch/pyexecutor/sampler.py`	Extended PyTorchModelEngine to propagate `resource_manager` through input-preparation paths; integrated `Eagle3SpecMetadata` handling and added tracking of `request_accepted_path` for per-request draft token acceptance. Updated `_prepare_tp_inputs` signature to accept resource manager. Refined KV cache length slicing in resource_manager and introduced `py_num_accepted_draft_tokens_indices` tracking in sampler for accepted draft paths; eliminated batch tree-sampling path.
Speculative Decoding Core `tensorrt_llm/_torch/speculative/eagle3.py`, `tensorrt_llm/_torch/speculative/interface.py`, `tensorrt_llm/_torch/speculative/spec_tree_manager.py`	Enhanced `Eagle3SpecMetadata` with `request_accepted_path` field and revised hidden-state read/write selection logic to use accepted-path indices. Expanded `attention_need_spec_dec_mode` with four explicit use-case branches. Substantially reworked `SpecTreeManager`: added persistent buffers (`spec_dec_packed_mask`, `spec_dec_position_offsets`, `top_k_list_cuda`, `tokens_gather_idx`, etc.); replaced `compute_spec_dec_pack_mask` with `compute_spec_dec_packed_mask`; expanded static-tree initialization with index mappings and node lists; introduced drafter-model-specific buffers for masks, offsets, and hidden-state indices.
Drafting Logic `tensorrt_llm/_torch/speculative/drafting_loops.py`, `tensorrt_llm/_torch/speculative/drafter.py`, `tensorrt_llm/_torch/speculative/model_drafter.py`	Added `prepare_for_generation_with_tree_decoding` helper function to assemble inputs for tree-based drafting. Updated `ChainDrafter.sample` signature to accept optional `spec_tree_manager` and added tree-based sampling branch with per-layer top-k. Modified `model_drafter` to propagate accepted draft token indices and use `max_total_draft_tokens` for buffer allocation and accumulation ranges.
Tests `tests/integration/defs/test_e2e.py`, `tests/unittest/_torch/speculative/test_draft_token_prepare_for_generation.py`, `tests/unittest/_torch/speculative/test_draft_token_tree_sampling.py`	Added new e2e test for Eagle3 tree-based decoding. Introduced comprehensive unittest for `prepare_for_generation_with_tree_decoding` with mock Eagle3 metadata and tree-manager objects, verifying attention and spec metadata state across batch/layer configurations. Refactored tree-sampling test to use `ChainDrafter` directly instead of `TorchSampler` pathway, introducing lightweight `DummyModel` and simplifying control flow.

Sequence Diagram(s)

sequenceDiagram
    participant Engine as PyTorchModelEngine
    participant AttMeta as AttentionMetadata
    participant SpecMeta as Eagle3SpecMetadata
    participant TreeMgr as SpecTreeManager
    participant Drafter as ChainDrafter
    
    rect rgb(240, 248, 255)
    Note over Engine,Drafter: Tree-Based Speculative Decoding Flow (New)
    end
    
    Engine->>Engine: _prepare_tp_inputs(resource_manager)
    Engine->>SpecMeta: Set request_accepted_path
    
    Engine->>AttMeta: update_spec_dec_param(batch_size, spec_metadata, spec_tree_manager, ...)
    AttMeta->>TreeMgr: Retrieve tree structure & buffers
    
    alt Static Tree Path
        TreeMgr->>AttMeta: Copy spec_dec_packed_mask, position_offsets
        AttMeta->>AttMeta: Populate kv_lens_cuda, seq_lens
    else Dynamic Tree Path
        AttMeta->>AttMeta: Initialize placeholders for dynamic updates
    end
    
    Engine->>Drafter: forward() with spec_tree_manager
    Drafter->>TreeMgr: get_generation_lengths(), get_masks(), get_offsets()
    Drafter->>Drafter: sample(draft_layer_idx, logits, spec_tree_manager)
    
    alt Tree Sampling
        Drafter->>TreeMgr: Retrieve per-layer top_k_list
        Drafter->>Drafter: Top-k sampling with tree constraints
    else Linear Sampling
        Drafter->>Drafter: Greedy/standard sampling
    end
    
    Drafter-->>Engine: Draft tokens with tree indices
    Engine->>Engine: prepare_for_generation_with_tree_decoding()
    Engine->>AttMeta: Update position_ids, masks, and indices per layer
    Engine->>SpecMeta: Update gather indices and hidden-state read/write offsets

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Areas requiring extra attention:

SpecTreeManager buffer management and initialization (spec_tree_manager.py): Extensive additions of static-tree buffers (tokens_gather_idx, logits_gather_idx, drafter-model-specific masks/offsets) require careful validation of shape consistency and initialization order, especially the transition between dynamic and static tree modes.
AttentionMetadata parameter flow (trtllm.py): The reworked update_spec_dec_param logic branches on static vs. dynamic tree types and populates multiple buffers conditionally; verify correctness of buffer population and alignment with downstream usage in attention kernels.
Eagle3SpecMetadata accepted-path logic (eagle3.py): The new request_accepted_path tracking and revised read/write index computation must be validated against actual request sequences and KV cache addressing to ensure no off-by-one or alignment errors.
ChainDrafter tree-based sampling (drafting_loops.py): The new prepare_for_generation_with_tree_decoding and tree-sampling paths introduce new dependencies on spec_tree_manager; ensure correct tensor reshaping, offset application, and layer iteration.
PyExecutor resource propagation (model_engine.py): Verify that resource_manager is correctly threaded through all input-preparation and warmup paths, and that Eagle3SpecMetadata instance checks and request_accepted_path assignments are complete and non-breaking.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	❓ Inconclusive	The PR description explains the high-level architecture and key ideas but lacks specific details on implementation changes, affected code paths, and explicit test coverage.	Add a 'Test Coverage' section listing specific tests that validate the draft token tree, CDL integration, and impact on existing functionality. Provide more implementation details on how XQA is used and how KV cache rewinding will be addressed.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "[TRTLLM-8160][feat] Add draft token tree runtime on CDL" directly aligns with the primary objective of the changeset, which is to implement runtime logic for draft token tree using capturable drafting loops. The title is concise, specific, and follows the required format with a valid JIRA ticket ID (TRTLLM-8160) and an appropriate type label (feat). It clearly communicates the main change without vague terminology or excessive detail.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 23

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

tensorrt_llm/_torch/pyexecutor/sampler.py (1)

773-799: Fix device/dtype mismatch in advanced indexing, convert scalar tensor to int, and ensure tensor-to-list conversion.

Three bugs confirmed:

Indexing dtype mismatch (lines 792–798): eagle_paths is int32; advanced indexing CPU tensors requires int64 indices.
Tensor-to-bool ambiguity (line 801): cur_accepted_len is a 0-D tensor; must convert to Python int before comparison.
Tensor-to-list assignment (line 825–826): Assigning int32 tensor slice to list slice stores tensor objects instead of integers; must call .tolist().

Apply the diff:

-            all_draft_tokens = torch.tensor(request.py_draft_tokens)  # [max_total_draft_tokens]
-            all_target_tokens = new_tokens_tensor[:, seq_slot, :].squeeze(
-                -1
-            )  # [max_total_draft_tokens]
+            # Host-side CPU tensors, ensure long dtype for indexing
+            all_draft_tokens = torch.as_tensor(request.py_draft_tokens, dtype=torch.long, device="cpu")
+            all_target_tokens = new_tokens_tensor[:, seq_slot, :].squeeze(-1).to(dtype=torch.long, device="cpu")  # [max_total_draft_tokens + 1]
@@
-            for path_idx, path in enumerate(eagle_paths):
-                path_exclude_root = (
-                    path[1:] - 1
-                )  # [max_draft_len], '[1:]' since the new_tokens does not contain the root node.
-                # '-1' is the index shift after exclude the root node.
-                draft_tokens_indices = path_exclude_root[path_exclude_root >= 0]  # [max_draft_len]
-                target_tokens_indices = path[path >= 0]  # [max_draft_len + 1]
+            for path_idx, path in enumerate(eagle_paths):
+                # Convert to long for CPU advanced indexing
+                path_long = path.to(dtype=torch.long)
+                path_exclude_root = path_long[1:] - 1  # exclude root; -1 index shift
+                draft_tokens_indices = path_exclude_root[path_exclude_root >= 0]
+                target_tokens_indices = path_long[path_long >= 0]
@@
-                cur_draft_tokens = all_draft_tokens[draft_tokens_indices]
-                cur_target_tokens = all_target_tokens[target_tokens_indices]
+                cur_draft_tokens = all_draft_tokens.index_select(0, draft_tokens_indices)
+                cur_target_tokens = all_target_tokens.index_select(0, target_tokens_indices)
@@
-                cur_accepted_len = torch.cumprod(
-                    (cur_draft_tokens == cur_target_tokens[:-1]).int(), dim=-1
-                ).sum()
-
-                # Accepted one more token from the target model.
-                cur_accepted_len += 1
-
-                if cur_accepted_len > longest_accepted_len:
+                cur_accepted_len = int(torch.cumprod(
+                    (cur_draft_tokens == cur_target_tokens[:-1]).to(torch.int32), dim=-1
+                ).sum().item()) + 1  # +1 accounts for root
+
+                if cur_accepted_len > longest_accepted_len:
                     longest_accepted_len = cur_accepted_len
                     longest_match_path_idx = path_idx
@@
-                request.py_num_accepted_draft_tokens_indices[: num_accepted_draft_tokens - 1] = (
-                    eagle_paths[longest_match_path_idx][1:longest_accepted_len]
-                )  # exclude the root node
+                accepted_indices = eagle_paths[longest_match_path_idx][1:longest_accepted_len].tolist()
+                request.py_num_accepted_draft_tokens_indices[: num_accepted_draft_tokens - 1] = accepted_indices  # exclude root

tests/unittest/_torch/speculative/test_draft_token_tree_sampling.py (1)

1-4: Missing NVIDIA Apache-2.0 header (2025)

Per repo guidelines, prepend the standard header.

Apply at file start:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
As per coding guidelines

tensorrt_llm/_torch/speculative/drafting_loops.py (1)

1-1: Missing NVIDIA Apache-2.0 header

Add the required NVIDIA Apache-2.0 header (year 2025).

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

1-1: Missing NVIDIA Apache-2.0 header

Add the required NVIDIA Apache-2.0 header (year 2025).

tensorrt_llm/_torch/attention_backend/trtllm.py (1)

1-1: Add required NVIDIA Apache-2.0 header (2025).

File is missing the mandatory license header. Please prepend it.

Apply this diff:

+# Copyright (c) 2025, NVIDIA CORPORATION.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

tensorrt_llm/_torch/speculative/spec_tree_manager.py (1)

310-313: The dynamic path indexing bug is confirmed. The code attempts 3D indexing ([:, i, :]) on a 2D tensor (eagle_paths[tree_idx] is shape [max_total_draft_tokens + 1, max_draft_len + 1]), which causes a shape mismatch at assignment. The proposed fix correctly reshapes the nonzero indices to 1D and assigns them row-wise to the 2D tensor.

🧹 Nitpick comments (13)

tensorrt_llm/_torch/speculative/drafter.py (1)
67-67: Remove useless expression.

self.max_total_draft_tokens is a no-op here. Drop it to satisfy linters and avoid confusion.
-            self.max_total_draft_tokens
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
584-585: Ensure contiguous int32 slice for KV lengths before passing to C++ op.

Slicing returns a view; make it contiguous and int32 to match extension expectations.
-        past_key_value_lengths = attn_metadata.kv_lens_cuda[:len(requests)]
+        past_key_value_lengths = (
+            attn_metadata.kv_lens_cuda.narrow(0, 0, len(requests)).to(torch.int32).contiguous()
+        )
Confirm torch.ops.tensorrt_llm.update_kv_cache_draft_token_location expects int32 on the same device as other KV tensors.
tests/unittest/_torch/speculative/test_draft_token_tree_sampling.py (3)
15-25: Make DummyModel.forward fail fast

Use explicit NotImplementedError to surface unintended calls during refactors.
 class DummyModel(torch.nn.Module):
@@
-    def forward(self, *args, **kwargs) -> torch.Tensor:
-        pass
+    def forward(self, *args, **kwargs) -> torch.Tensor:
+        raise NotImplementedError("DummyModel.forward should not be called in this unit test")
54-60: Decouple from external model roots to keep unit test hermetic

Avoid requiring llm_models_root for a path that is not used. Consider a benign default like os.environ.get("LLM_MODELS_ROOT", "/tmp") or pass a dummy path.
-        spec_config = EagleDecodingConfig(
+        spec_config = EagleDecodingConfig(
             max_draft_len=max_draft_len,
             max_total_draft_tokens=max_total_draft_tokens,
-            speculative_model_dir=eagle_model_dir,
+            speculative_model_dir=os.environ.get("LLM_MODELS_ROOT", "/tmp"),
61-67: Assertion style and device selection nits

Prefer torch.equal(output_tokens, ref_new_tokens) for clarity.

Derive ref tensor device from logits.device to avoid hardcoding CUDA.
-        assert torch.all(output_tokens == ref_new_tokens)
+        assert torch.equal(output_tokens, ref_new_tokens)
And when constructing ref_new_tokens:
-    ref_new_tokens = torch.tensor([...], device='cuda')
+    ref_new_tokens = torch.tensor([...], device=logits.device)
tests/integration/defs/test_e2e.py (1)

2060-2093: Refactor eagle_choices string construction for clarity; remove memory-guard suggestion

The --eagle_choices flag is confirmed as supported in quickstart_advanced.py (type=str, default=None). However, refactor the eagle_choices JSON construction using json.dumps() for consistency with existing codebase patterns (e.g., test_e2e.py line 709) and to reduce manual JSON string errors.

The memory-guard suggestion (skipif marker) is unnecessary—the test suite consistently validates memory requirements post-execution via _check_mem_usage(), which is already present and correct in this test (_check_mem_usage(running_log, [27, 0, 0, 0])).
tensorrt_llm/_torch/speculative/eagle3.py (2)
174-178: Ensure paired iterables are same length

Add an assertion before the loop to guarantee request_ids and seq_lens have equal length (useful under Py3.8 where zip(strict=...) is unavailable).
@@
         if not self.is_draft_model:
-            for req_id, seq_len in zip(self.request_ids, self.seq_lens):
+            assert len(self.request_ids) == len(self.seq_lens), \
+                "request_ids and seq_lens must be the same length"
+            for req_id, seq_len in zip(self.request_ids, self.seq_lens):
197-201: Replace fullwidth parenthesis in comment

Use ASCII ) to avoid lint failures (RUF003).
-            # 2）is_first_draft
+            # 2) is_first_draft
tests/unittest/_torch/speculative/test_draft_token_prepare_for_generation.py (3)
17-17: Avoid mutating sys.path in tests

This path hack is brittle in CI. Prefer relying on the test runner’s import paths.
-sys.path.append(os.path.join(os.path.dirname(__file__), '..'))
+# Avoid mutating sys.path; rely on test runner configuration.
6-8: Remove unused model path plumbing

llm_models_root()/eagle_model_dir are not needed; pass a dummy string to EagleDecodingConfig to decouple from local assets.
-from utils.llm_data import llm_models_root
@@
-    models_path = llm_models_root()
-    eagle_model_dir = f"{models_path}/EAGLE3-LLaMA3.1-Instruct-8B"  # It will not actually be used.
+    eagle_model_dir = "unused"
Also applies to: 22-24

662-663: Unnecessary unittest entrypoint

This is a pytest-style function test. unittest.main() won’t discover it; safe to drop to avoid confusion.
-if __name__ == "__main__":
-    unittest.main()
+# Intentionally no unittest entrypoint; use pytest discovery.
tensorrt_llm/_torch/speculative/drafting_loops.py (1)
145-147: Typo in comment

“toshift” → “to shift”.
-        1] - 1  # shape: [next_layer_gen_len_per_req]. -1 is toshift the root node
+        1] - 1  # shape: [next_layer_gen_len_per_req]. -1 is to shift the root node
tensorrt_llm/_torch/speculative/spec_tree_manager.py (1)
324-364: Optional: simplify packed-mask computation with bitshifts.

Avoid pow on int tensors and repeated reshape rebinds; use bit operations for clarity and speed.

Apply this diff:
-        num_blocks = math.ceil((self.max_total_draft_tokens + 1) / 32)
-        int_tensor = mask_matrix.reshape(
-            -1, num_process_tokens
-        )  # shape: [num_trees * num_process_tokens, num_process_tokens]
-        packed_mask = packed_mask.reshape(
-            -1,
-            num_blocks)  # shape: [num_trees * num_process_tokens, num_blocks]
-
-        for block_idx in range(num_blocks):
-            start_idx = block_idx * 32
-            end_idx = min(start_idx + 32, num_process_tokens)
-            if end_idx < start_idx:
-                break
-            block_bits = int_tensor[:, start_idx:end_idx]
-            weight = torch.pow(
-                2,
-                torch.arange(end_idx - start_idx,
-                             dtype=torch.int32,
-                             device=int_tensor.device))
-            block_value = torch.sum(block_bits * weight, dim=-1)
-            packed_mask[:, block_idx] = block_value
-
-        packed_mask = packed_mask.reshape(num_trees, num_process_tokens,
-                                          num_blocks)
+        num_blocks = math.ceil((self.max_total_draft_tokens + 1) / 32)
+        rows = mask_matrix.reshape(-1, num_process_tokens)
+        out = packed_mask.reshape(-1, num_blocks)
+        for block_idx in range(num_blocks):
+            start = block_idx * 32
+            end = min(start + 32, num_process_tokens)
+            if end <= start:
+                break
+            span = end - start
+            weights = (torch.ones(span, dtype=torch.int32, device=rows.device) << torch.arange(span, dtype=torch.int32, device=rows.device))
+            out[:, block_idx] = (rows[:, start:end].to(torch.int32) * weights).sum(dim=-1)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2956978 and ef8b2b6.

📒 Files selected for processing (15)

cpp/tensorrt_llm/thop/attentionOp.cpp (2 hunks)
tensorrt_llm/_torch/attention_backend/interface.py (1 hunks)
tensorrt_llm/_torch/attention_backend/trtllm.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (18 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/sampler.py (3 hunks)
tensorrt_llm/_torch/speculative/drafter.py (1 hunks)
tensorrt_llm/_torch/speculative/drafting_loops.py (3 hunks)
tensorrt_llm/_torch/speculative/eagle3.py (4 hunks)
tensorrt_llm/_torch/speculative/interface.py (1 hunks)
tensorrt_llm/_torch/speculative/model_drafter.py (3 hunks)
tensorrt_llm/_torch/speculative/spec_tree_manager.py (7 hunks)
tests/integration/defs/test_e2e.py (1 hunks)
tests/unittest/_torch/speculative/test_draft_token_prepare_for_generation.py (1 hunks)
tests/unittest/_torch/speculative/test_draft_token_tree_sampling.py (10 hunks)

🧰 Additional context used

📓 Path-based instructions (6)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}