Skip to content

Conversation

@ziyixiong-nv
Copy link
Collaborator

@ziyixiong-nv ziyixiong-nv commented Sep 17, 2025

Summary by CodeRabbit

  • New Features

    • Added support for chunked context requests in attention, including scenarios with draft tokens.
  • Performance

    • Improved efficiency and throughput by optimizing cache updates when mixing context extension and generation.
  • Bug Fixes

    • Ensured consistent behavior between CUDA graph capture and normal execution paths.
    • Prevented unintended cache updates for unsupported attention metadata types, improving stability.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@ziyixiong-nv ziyixiong-nv requested a review from a team as a code owner September 17, 2025 06:42
@ziyixiong-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18916 [ run ] triggered by Bot

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 17, 2025

📝 Walkthrough

Walkthrough

Adds conditional KV-cache lens updates for chunked-context requests in both preprocess and postprocess paths, applied only when attention metadata is TrtllmAttentionMetadata. Introduces num_chunked_ctx_requests and updates kv_lens_cuda slices accordingly to maintain consistency for CUDA graph capture and normal runs.

Changes

Cohort / File(s) Summary of Changes
KV-cache lens handling for chunked context
tensorrt_llm/_torch/pyexecutor/model_engine.py
- Add num_chunked_ctx_requests from attn_metadata
- Conditionally update kv_lens_cuda slices based on num_chunked_ctx_requests and num_ctx_requests/num_seqs
- Apply identical logic in _preprocess_inputs and _postprocess_inputs
- Guard updates to TrtllmAttentionMetadata only; leave other metadata paths unchanged
- Note: generation with draft_tokens considered chunked when extend_ctx is True

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Caller
  participant ME as ModelEngine
  participant Pre as _preprocess_inputs
  participant Meta as TrtllmAttentionMetadata
  participant GPU as kv_lens_cuda

  C->>ME: run(...)
  ME->>Pre: preprocess(inputs, attn_metadata)
  Pre->>Meta: inspect num_ctx_requests, num_seqs,<br/>num_gen_requests, num_chunked_ctx_requests
  alt attn_metadata is TrtllmAttentionMetadata
    alt num_chunked_ctx_requests > 0
      Pre->>GPU: update kv_lens_cuda[num_ctx - num_chunked : num_ctx]<br/>= prev_offsets[:num_chunked]
    else No chunked context
      Pre->>GPU: update kv_lens_cuda[num_ctx : num_seqs]<br/>= prev_offsets[:num_gen]
    end
  else Other metadata types
    Note over Pre,Meta: No kv_lens_cuda updates
  end
  ME-->>C: execute model step(s)

  rect rgba(220,240,255,0.5)
  note over ME: Postprocess path mirrors updates
  ME->>ME: _postprocess_inputs(...)
  ME->>Meta: re-evaluate counts
  alt TrtllmAttentionMetadata
    alt num_chunked_ctx_requests > 0
      ME->>GPU: update same chunked slice
    else No chunked context
      ME->>GPU: update generation slice
    end
  else Other metadata
    Note over ME,Meta: No kv_lens_cuda updates
  end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR body contains only the repository template and the "@coderabbitai summary" marker but lacks a concrete Description, Test Coverage details, and an explanation of the bug, code changes, or expected behavior. Required information such as the root cause, affected files/functions, steps to reproduce, and which tests were added or updated is missing. Because of these omissions the description is insufficient for review and must be completed. Populate the Description with a concise bug summary, root cause, and the exact code paths changed (e.g., model_engine.py kv_lens_cuda updates and TrtllmAttentionMetadata/ chunked-context behavior) and explain why the change is needed. In Test Coverage list the specific tests added or modified, how to run them, and the expected outcomes or attach a failing reproducer if available. Link the NVBugs ID and any related commits or benchmarks, note any performance/regression considerations, and complete the PR checklist (including CODEOWNERS/reviewer suggestions) before requesting review.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title concisely and accurately summarizes the primary change — correcting the logic that updates kv_lens_cuda — and follows the repository template by including the NVBugs ID and the [fix] type. It directly matches the changes described in the diff (kv_lens_cuda update logic for TrtllmAttentionMetadata and chunked-context handling in model_engine.py) and is specific enough for a reviewer scanning PR history. Therefore it meets the repository's title requirements.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

1234-1243: Add sanity checks to prevent slice length mismatches

Add cheap debug assertions before updating kv_lens_cuda to catch off‑by‑one/shape drift. Apply the diff below at this site and add the same check at the other similar site in tensorrt_llm/_torch/pyexecutor/model_engine.py (around lines 1185–1189 and 1226–1230). Note: attn_metadata.num_chunked_ctx_requests is set at ~lines 1698 and 1702.

                 # Only TrtllmAttentionMetadata has kv_lens_cuda.
                 if isinstance(inputs['attn_metadata'], TrtllmAttentionMetadata):
+                    # Debug sanity checks; safe to remove once proven stable.
+                    if num_chunked_ctx_requests > 0:
+                        assert 0 <= num_chunked_ctx_requests <= num_ctx_requests
+                    else:
+                        assert num_gen_requests == (num_seqs - num_ctx_requests)
                     if num_chunked_ctx_requests > 0:
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 523a17d and 909cf18.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • num_ctx_tokens (263-264)
  • num_seqs (245-249)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
  • TrtllmAttentionMetadata (532-1107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18916 [ run ] completed with state FAILURE

@ziyixiong-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18926 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18926 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #14185 completed with status: 'FAILURE'

@ziyixiong-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19059 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19059 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #14297 completed with status: 'FAILURE'

Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
@ziyixiong-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19074 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19074 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14309 completed with status: 'FAILURE'

@ziyixiong-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19132 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19132 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14357 completed with status: 'FAILURE'

@ziyixiong-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19163 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19163 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14382 completed with status: 'FAILURE'

@ziyixiong-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19189 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19189 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14402 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@ziyixiong-nv ziyixiong-nv merged commit 420f0fb into NVIDIA:main Sep 19, 2025
4 of 5 checks passed
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 19, 2025
NVIDIA#7790)

Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
NVIDIA#7790)

Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
MrGeva pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Sep 21, 2025
NVIDIA#7790)

Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants