Skip to content

Conversation

@QiJune
Copy link
Collaborator

@QiJune QiJune commented Sep 5, 2025

Summary by CodeRabbit

  • New Features
    • Enhanced CUDA Graph execution with support for speculative decoding and draft tokens.
    • More configurable graph behavior (batch sizing, padding, memory pool) with improved multi-GPU awareness.
  • Refactor
    • Decoupled the CUDA graph runner from the model engine with explicit, injectable dependencies.
    • Streamlined graph capture/replay logic and caching with draft-length awareness.
  • Tests
    • Updated unit tests to use a mock graph-runner helper and the new capture/replay API (extra boolean flag).

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@QiJune
Copy link
Collaborator Author

QiJune commented Sep 5, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17812 [ run ] triggered by Bot

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 5, 2025

📝 Walkthrough

Walkthrough

Refactors CUDAGraphRunner to a dependency-injected, engine-agnostic component with expanded APIs carrying speculative-decoding context and metadata. Updates ModelEngine to wire new parameters and draft-token CUDA buffers. Migrates unit tests to a new create_mock_cuda_graph_runner helper and adapts capture/replay call signatures with an added boolean flag.

Changes

Cohort / File(s) Summary
CUDA Graph Runner Refactor
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
Replaces engine-centric constructor with explicit keyword args; expands methods to pass/return speculative-decoding context and metadata; introduces per-(batch_size,draft_len) graph keys; revises capture/replay, padding, and memory pool handling; integrates Mapping/MPIDist/ResourceManager dependencies.
Model Engine Integration
tensorrt_llm/_torch/pyexecutor/model_engine.py
Wires new CUDAGraphRunner constructor parameters; adds/initializes draft token CUDA buffers; propagates is_spec_decode, attn/spec metadata and draft tokens through maybe_get_cuda_graph, needs_capture, capture, replay.
Test Helper Update
tests/unittest/_torch/helpers.py
Removes mock engine; adds create_mock_cuda_graph_runner(batch_size) returning configured CUDAGraphRunner using Mapping/kv_cache_manager_key; updates defaults (no padding, beam_width=1, max_draft_len=0).
Modeling Tests Migration
tests/unittest/_torch/modeling/test_modeling_exaone4.py, .../test_modeling_llama.py, .../test_modeling_llama_min_latency.py, .../test_modeling_mistral.py, .../test_modeling_mixtral.py, .../test_modeling_mllama.py, .../test_modeling_nemotron.py, .../test_modeling_phi3.py, .../test_modeling_qwen.py, .../test_modeling_qwen_moe.py
Replace CUDAGraphRunner+mock-engine with create_mock_cuda_graph_runner; gate usage by scenario.use_cuda_graph; update capture/replay signatures to include extra boolean argument; minor import adjustments.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant ME as ModelEngine
  participant GR as CUDAGraphRunner
  participant FW as forward_fn
  participant Attn as AttentionMetadata
  participant Spec as SpecMetadata

  Note over ME,GR: New flow passes is_spec_decode, metadata, draft tokens
  ME->>GR: maybe_get_cuda_graph(batch, iter_counter, is_spec_decode, Attn, Spec?, draft_tokens_cuda)
  GR-->>ME: (can_use_graph, Attn', Spec')

  alt needs capture
    ME->>GR: needs_capture(batch_size, is_spec_decode)
    GR-->>ME: bool
    alt true
      ME->>GR: capture(batch_size, is_spec_decode, FW, initial_inputs)
      GR->>FW: forward(**initial_inputs) during capture
      FW-->>GR: outputs (captured)
      GR-->>ME: capture complete
    end
  end

  alt can_use_graph
    ME->>GR: replay(batch_size, is_spec_decode, current_inputs)
    GR->>GR: update static tensors, set draft_len from Spec'
    GR-->>ME: logits (replayed)
  else not eligible
    ME->>FW: forward(**current_inputs)
    FW-->>ME: logits
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

Community want to contribute

Suggested reviewers

  • byshiue
  • hypdeb
  • amukkara
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (1)

416-423: Fix int64→int32 dtype mismatch in CUDA-graph path (will crash at replay).

cuda_graph_runner allocates static input_ids/position_ids as int32; current inputs use default int (arange → int64). torch.Tensor.copy_ requires matching dtypes and will error at replay. Cast both to int32 before capture/replay.

Apply this diff:

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 429-429

tests/unittest/_torch/modeling/test_modeling_phi3.py (1)

322-329: Align dtypes with CUDA-graph static buffers (int32).

position_ids from arange default to int64; runner uses int32 buffers, causing copy_ dtype mismatch at replay. Cast inputs before capture/replay.

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 335-335

tests/unittest/_torch/modeling/test_modeling_mllama.py (1)

429-436: Prevent dtype mismatch with CUDA-graph int32 inputs.

Ensure input_ids/position_ids are int32 before capture/replay to match runner’s static tensors.

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 442-442

tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (1)

327-334: Cast position_ids to int32 for CUDA-graph replay.

Runner’s static tensors are int32; arange yields int64. Cast before capture to avoid copy_ errors.

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 340-340

tests/unittest/_torch/modeling/test_modeling_mixtral.py (1)

322-329: Match int32 expectations in CUDA-graph path.

Cast inputs to int32 to align with cuda_graph_runner’s buffers.

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 335-335

tests/unittest/_torch/modeling/test_modeling_qwen.py (1)

85-93: Python 3.8 compatibility: avoid built-in generics

dict[str, Any] requires Python 3.9+. Tests target 3.8+. Use typing.Dict.

-from typing import Any
+from typing import Any, Dict
@@
-def reduce_qwen_config(mem_for_full_model: int, config_dict: dict[str, Any]):
+def reduce_qwen_config(mem_for_full_model: int, config_dict: Dict[str, Any]):
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

400-406: Fix dtype of index tensors (must be torch.long).

gather_ids_cuda and previous_pos_indices_cuda are later used for tensor indexing (e.g., logits[gather_ids]), which requires LongTensor indices in PyTorch. Using torch.int (int32) risks runtime errors.

Apply:

-            self.gather_ids_cuda = torch.empty((self.max_num_tokens, ),
-                                               dtype=torch.int,
-                                               device='cuda')
-            self.previous_pos_indices_cuda = torch.empty(
-                (self.max_num_tokens, ), dtype=torch.int, device='cuda')
+            self.gather_ids_cuda = torch.empty((self.max_num_tokens, ),
+                                               dtype=torch.long,
+                                               device='cuda')
+            self.previous_pos_indices_cuda = torch.empty(
+                (self.max_num_tokens, ), dtype=torch.long, device='cuda')

Note: previous_batch_indices_cuda (Line 444) is also used as an index and should be torch.long for consistency. See additional snippet below.

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

225-225: Inconsistent return value from replay method

The replay method returns output_ref which is a callable (weak reference), but the return type annotation suggests it should return Optional[torch.Tensor]. The method should either call the reference or update the return type.

Call the weak reference to get the actual tensor:

-        return output_ref
+        return output_ref() if output_ref else None
🧹 Nitpick comments (11)
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (1)

406-408: Release CUDA graph resources after test.

Call graph_runner.clear() after use to free the graph’s memory pool.

Example (near test end):

if graph_runner is not None:
    graph_runner.clear()
tests/unittest/_torch/modeling/test_modeling_phi3.py (1)

312-314: Exercise the CUDA-graph path in this test.

scenario.use_cuda_graph is never set to True here, so the CUDA-graph code isn’t exercised. Consider parameterizing to run both paths.

tests/unittest/_torch/helpers.py (1)

171-186: Helper factory looks good; add return type and optional knobs

Annotate the return type and expose use_mrope/enable_attention_dp to avoid future test drift.

-def create_mock_cuda_graph_runner(batch_size: int):
+def create_mock_cuda_graph_runner(batch_size: int) -> CUDAGraphRunner:
     return CUDAGraphRunner(
         use_cuda_graph=True,
         cuda_graph_padding_enabled=False,
         supported_batch_sizes=[batch_size],
         max_supported_batch_size=batch_size,
         max_batch_size=batch_size,
         max_beam_width=1,
         max_draft_len=0,
         use_mrope=False,
         spec_config=None,
         cuda_graph_mem_pool=None,
         enable_attention_dp=False,
         mapping=Mapping(),
         dist=None,
         kv_cache_manager_key=ResourceManagerType.KV_CACHE_MANAGER)

Optionally:

-def create_mock_cuda_graph_runner(batch_size: int) -> CUDAGraphRunner:
+def create_mock_cuda_graph_runner(
+    batch_size: int,
+    *,
+    use_mrope: bool = False,
+    enable_attention_dp: bool = False,
+) -> CUDAGraphRunner:
@@
-        use_mrope=False,
+        use_mrope=use_mrope,
@@
-        enable_attention_dp=False,
-        mapping=Mapping(),
+        enable_attention_dp=enable_attention_dp,
+        mapping=Mapping(enable_attention_dp=enable_attention_dp),
tests/unittest/_torch/modeling/test_modeling_exaone4.py (1)

25-25: Fix E402: keep imports at top-of-file

Move this import to the top with the other imports to satisfy Ruff E402.

+from _torch.helpers import create_mock_cuda_graph_runner
@@
-from _torch.helpers import create_mock_cuda_graph_runner
tests/unittest/_torch/modeling/test_modeling_nemotron.py (2)

320-321: Minor: avoid magic number for batch size in tests.

Consider a local graph_bs = 1 to avoid repeating the literal and ease future edits.


343-344: Optional: assert non-None replay output.

replay returns an Optional (weak-ref). Add a quick assert logits is not None for clearer failures.

tests/unittest/_torch/modeling/test_modeling_mistral.py (2)

402-402: Minor: avoid magic number for batch size.

Define a local graph_bs = 1 and reuse it in capture/replay calls.


422-423: Optional: guard against None from replay.

Add assert logits is not None before comparisons to make weak-ref issues obvious.

tests/unittest/_torch/modeling/test_modeling_llama.py (2)

328-329: Minor: avoid magic number for batch size.

Use a local graph_bs = 1 and reuse it.


350-351: Optional: assert non-None replay output.

Add assert logits is not None to surface weak-ref invalidation early.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

412-413: Use instance attribute for clarity.

Prefer self.spec_config.max_draft_len over the shadowed spec_config.

-            self.max_draft_len = spec_config.max_draft_len
+            self.max_draft_len = self.spec_config.max_draft_len
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 25389c9 and c3143c4.

📒 Files selected for processing (13)
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (10 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (5 hunks)
  • tests/unittest/_torch/helpers.py (2 hunks)
  • tests/unittest/_torch/modeling/test_modeling_exaone4.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_llama.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_mistral.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_mixtral.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_mllama.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_nemotron.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_phi3.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_qwen.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tests/unittest/_torch/modeling/test_modeling_mllama.py
  • tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py
  • tests/unittest/_torch/modeling/test_modeling_exaone4.py
  • tests/unittest/_torch/helpers.py
  • tests/unittest/_torch/modeling/test_modeling_llama.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tests/unittest/_torch/modeling/test_modeling_mistral.py
  • tests/unittest/_torch/modeling/test_modeling_mixtral.py
  • tests/unittest/_torch/modeling/test_modeling_phi3.py
  • tests/unittest/_torch/modeling/test_modeling_qwen_moe.py
  • tests/unittest/_torch/modeling/test_modeling_nemotron.py
  • tests/unittest/_torch/modeling/test_modeling_qwen.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tests/unittest/_torch/modeling/test_modeling_mllama.py
  • tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py
  • tests/unittest/_torch/modeling/test_modeling_exaone4.py
  • tests/unittest/_torch/helpers.py
  • tests/unittest/_torch/modeling/test_modeling_llama.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tests/unittest/_torch/modeling/test_modeling_mistral.py
  • tests/unittest/_torch/modeling/test_modeling_mixtral.py
  • tests/unittest/_torch/modeling/test_modeling_phi3.py
  • tests/unittest/_torch/modeling/test_modeling_qwen_moe.py
  • tests/unittest/_torch/modeling/test_modeling_nemotron.py
  • tests/unittest/_torch/modeling/test_modeling_qwen.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tests/unittest/_torch/modeling/test_modeling_mllama.py
  • tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py
  • tests/unittest/_torch/modeling/test_modeling_exaone4.py
  • tests/unittest/_torch/helpers.py
  • tests/unittest/_torch/modeling/test_modeling_llama.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tests/unittest/_torch/modeling/test_modeling_mistral.py
  • tests/unittest/_torch/modeling/test_modeling_mixtral.py
  • tests/unittest/_torch/modeling/test_modeling_phi3.py
  • tests/unittest/_torch/modeling/test_modeling_qwen_moe.py
  • tests/unittest/_torch/modeling/test_modeling_nemotron.py
  • tests/unittest/_torch/modeling/test_modeling_qwen.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
🧠 Learnings (2)
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

  • tests/unittest/_torch/modeling/test_modeling_mllama.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tests/unittest/_torch/modeling/test_modeling_mllama.py
🧬 Code graph analysis (13)
tests/unittest/_torch/modeling/test_modeling_mllama.py (2)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (2)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tests/unittest/_torch/modeling/test_modeling_exaone4.py (3)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
  • forward (79-87)
  • forward (2211-2313)
tests/unittest/_torch/helpers.py (3)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
  • CUDAGraphRunner (22-316)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
  • ResourceManagerType (47-52)
tensorrt_llm/mapping.py (1)
  • Mapping (32-513)
tests/unittest/_torch/modeling/test_modeling_llama.py (2)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)
  • CUDAGraphRunner (22-316)
  • needs_capture (139-141)
  • replay (196-225)
tests/unittest/_torch/modeling/test_modeling_mistral.py (2)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tests/unittest/_torch/modeling/test_modeling_mixtral.py (2)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tests/unittest/_torch/modeling/test_modeling_phi3.py (2)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (2)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tests/unittest/_torch/modeling/test_modeling_nemotron.py (3)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
  • forward (79-87)
  • forward (2211-2313)
tests/unittest/_torch/modeling/test_modeling_qwen.py (3)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
  • forward (79-87)
  • forward (2211-2313)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (8)
tensorrt_llm/mapping.py (1)
  • Mapping (32-513)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • AttentionMetadata (39-347)
  • create_cuda_graph_metadata (288-328)
tensorrt_llm/_torch/distributed/communicator.py (3)
  • MPIDist (98-145)
  • tp_size (46-47)
  • tp_allgather (138-139)
tensorrt_llm/_torch/expert_statistic.py (2)
  • ExpertStatistic (10-98)
  • set_iter (32-36)
tensorrt_llm/_torch/modules/multi_stream_utils.py (1)
  • with_multi_stream (26-32)
tensorrt_llm/_torch/speculative/interface.py (2)
  • SpecMetadata (122-217)
  • create_cuda_graph_metadata (181-192)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
  • ResourceManagerType (47-52)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • ScheduledRequests (18-39)
  • can_run_cuda_graph (31-32)
🪛 Ruff (0.12.2)
tests/unittest/_torch/modeling/test_modeling_exaone4.py

25-25: Module level import not at top of file

(E402)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

43-43: Undefined name DecodingBaseConfig

(F821)


72-72: Undefined name Request

(F821)

🔇 Additional comments (21)
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (1)

7-7: Good switch to factory helper.

Decoupling tests from CUDAGraphRunner construction via create_mock_cuda_graph_runner improves isolation and avoids circular deps.

tests/unittest/_torch/modeling/test_modeling_phi3.py (1)

7-7: Factory-based runner import looks good.

Keeps tests engine-agnostic and matches the new API surface.

tests/unittest/_torch/modeling/test_modeling_mllama.py (1)

6-6: Nice: helper import unifies CUDA-graph setup across tests.

tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (1)

6-6: Good move to create_mock_cuda_graph_runner.

tests/unittest/_torch/modeling/test_modeling_mixtral.py (1)

6-6: Helper import LGTM.

tests/unittest/_torch/modeling/test_modeling_exaone4.py (1)

355-357: CUDA graph capture/replay usage matches new API

Capture with the added boolean flag and subsequent replay look correct.

Also applies to: 364-364

tests/unittest/_torch/modeling/test_modeling_qwen.py (1)

20-20: Import of helper factory looks correct

Aligned with new factory-based setup.

tests/unittest/_torch/modeling/test_modeling_nemotron.py (2)

7-7: LGTM: switched to factory helper.

Importing create_mock_cuda_graph_runner decouples tests from the runner class.


335-337: LGTM: updated capture signature correctly.

is_spec_decode=False here is appropriate for pure decoding.

tests/unittest/_torch/modeling/test_modeling_mistral.py (2)

10-10: LGTM: switched to factory helper.

Keeps tests aligned with the new API surface.


416-416: LGTM: capture API use matches new signature.

tests/unittest/_torch/modeling/test_modeling_llama.py (2)

7-7: LGTM: switched to factory helper.


343-345: LGTM: capture API updated correctly.

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

460-474: LGTM: runner decoupled and fully parameterized.

Constructor wiring looks correct and removes the prior dependency.


2258-2264: Confirmed safe: draft_tokens_cuda=None is tolerated
maybe_get_cuda_graph only references draft_tokens_cuda when spec_metadata is non-null, so passing None for non-speculative paths is ignored and causes no errors.

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (6)

32-49: Well-designed dependency injection refactor

Great work converting from engine-centric initialization to explicit dependency injection! This change successfully breaks the circular dependency by accepting individual parameters rather than the full engine object. The use of keyword-only arguments ensures clarity at call sites.


77-85: Enhanced API with speculative decoding context

The updated maybe_get_cuda_graph signature properly extends support for speculative decoding by accepting is_spec_decode, spec_metadata, and draft_tokens_cuda parameters. This provides the necessary context for graph eligibility decisions.


101-111: Distributed batch size synchronization logic looks correct

The multi-GPU synchronization properly gathers batch information across TP ranks and validates consistency before allowing CUDA graph execution. This ensures all ranks are in sync.


127-135: Proper metadata creation for CUDA graphs

The code correctly creates graph-specific metadata for both attention and speculative decoding, properly handling the draft tokens buffer assignment.


168-171: Conditional mrope position deltas handling

Good use of the use_mrope flag to conditionally include position deltas in the static tensors.


266-271: Comprehensive dummy request configuration

The padding dummy request creation properly includes all necessary parameters for speculative decoding support including max_num_draft_tokens, use_mrope, and max_beam_width.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17812 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13336 completed with status: 'FAILURE'

@QiJune
Copy link
Collaborator Author

QiJune commented Sep 19, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19305 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19305 [ run ] completed with state FAILURE

@QiJune
Copy link
Collaborator Author

QiJune commented Sep 19, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19310 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19310 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14499 completed with status: 'FAILURE'

@QiJune
Copy link
Collaborator Author

QiJune commented Sep 25, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19948 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19948 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15018 completed with status: 'FAILURE'

@QiJune
Copy link
Collaborator Author

QiJune commented Sep 25, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23615 [ run ] triggered by Bot. Commit: ad5be14

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23615 [ run ] completed with state SUCCESS. Commit: ad5be14
/LLM/main/L0_MergeRequest_PR pipeline #17769 completed with status: 'FAILURE'

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
@QiJune
Copy link
Collaborator Author

QiJune commented Nov 10, 2025

/bot run

Copy link
Collaborator

@hypdeb hypdeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23993 [ run ] triggered by Bot. Commit: cb59256

@QiJune QiJune enabled auto-merge (squash) November 10, 2025 11:15
@QiJune
Copy link
Collaborator Author

QiJune commented Nov 10, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24023 [ run ] triggered by Bot. Commit: cb59256

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23993 [ run ] completed with state ABORTED. Commit: cb59256
/LLM/main/L0_MergeRequest_PR pipeline #18070 completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24023 [ run ] completed with state SUCCESS. Commit: cb59256
/LLM/main/L0_MergeRequest_PR pipeline #18098 completed with status: 'FAILURE'

@QiJune
Copy link
Collaborator Author

QiJune commented Nov 11, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24080 [ run ] triggered by Bot. Commit: cb59256

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24080 [ run ] completed with state SUCCESS. Commit: cb59256
/LLM/main/L0_MergeRequest_PR pipeline #18147 completed with status: 'FAILURE'

@QiJune
Copy link
Collaborator Author

QiJune commented Nov 11, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24132 [ run ] triggered by Bot. Commit: cb59256

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24132 [ run ] completed with state SUCCESS. Commit: cb59256
/LLM/main/L0_MergeRequest_PR pipeline #18193 completed with status: 'FAILURE'

@QiJune
Copy link
Collaborator Author

QiJune commented Nov 11, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24184 [ run ] triggered by Bot. Commit: cb59256

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24184 [ run ] completed with state SUCCESS. Commit: cb59256
/LLM/main/L0_MergeRequest_PR pipeline #18235 completed with status: 'SUCCESS'

@QiJune QiJune merged commit 524754b into NVIDIA:main Nov 11, 2025
5 checks passed
suyoggupta pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Nov 12, 2025
…and cuda graph runner (NVIDIA#7572)

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants