Skip to content

Conversation

@kaiyux
Copy link
Member

@kaiyux kaiyux commented Aug 14, 2025

Summary by CodeRabbit

  • New Features

    • Profiling now records per-iteration device timings in addition to host timings, reporting previous device step time (ms) alongside host_step_time; first iteration shows “N/A”.
    • Device timing is integrated without changing existing profiling exports or workflows.
  • Refactor

    • Profiling range labels now use 1-based iteration indices for clearer profiling output.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@kaiyux kaiyux requested a review from a team as a code owner August 14, 2025 15:41
@kaiyux kaiyux requested a review from HuiGao-NV August 14, 2025 15:41
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 14, 2025

📝 Walkthrough

Walkthrough

Adds device-side per-iteration CUDA event timing to the executor’s profiling, alternating two event pairs to compute the previous iteration’s device time. Logging now includes host_step_time and prev_device_step_time. Also updates NVTX range labeling to 1-based iteration indexing. Existing profiling/trace mechanisms remain unchanged.

Changes

Cohort / File(s) Change Summary
Profiling and device timing augmentation
tensorrt_llm/_torch/pyexecutor/py_executor.py
Adds two CUDA event pairs and logic to record start/end per iteration, alternating by parity; records start events at iteration start and end events at iteration end; synchronizes the opposite pair to compute prev_device_step_time (ms) for the previous iteration; first iteration reports device time as N/A; replaces prior elapsed_time log field with host_step_time and prev_device_step_time; preserves chrome trace and cudaProfilerStart/Stop flows; updates NVTX range label to use 1-based iteration index for _forward_step.

Sequence Diagram(s)

sequenceDiagram
    participant Executor as PyExecutor
    participant Prof as _profiler ctx
    participant CUDA as CUDA Events
    participant Log as Logger

    loop Each iteration (it)
        Executor->>Prof: Enter profiling iteration
        alt it % 2 == 0
            Executor->>CUDA: Record start_event_1 (create if needed)
        else
            Executor->>CUDA: Record start_event_2 (create if needed)
        end

        Executor->>Executor: _forward_step (NVTX range with it+1)

        alt it % 2 == 0
            Executor->>CUDA: Record end_event_1
            Executor->>CUDA: Sync start_event_2/end_event_2 → compute prev_device_step_time
        else
            Executor->>CUDA: Record end_event_2
            Executor->>CUDA: Sync start_event_1/end_event_1 → compute prev_device_step_time
        end

        Executor->>Executor: Compute host_step_time
        Executor->>Log: Log host_step_time and prev_device_step_time (N/A on first)
        Prof-->>Executor: Continue/exit profiling
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15 minutes

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

444-444: Use perf_counter and unify host/device units with consistent formatting

  • perf_counter provides higher-precision monotonic timing than time.time for durations.
  • Keep units consistent with device timings (ms) and avoid mutating prev_device_step_time’s type between float and str.
@@
-                end_time = time.time()
+                end_time = time.perf_counter()
@@
-                if prev_device_step_time is None:
-                    prev_device_step_time = "N/A" # Handle first iteration
-                else:
-                    prev_device_step_time = f"{prev_device_step_time}ms"
+                if prev_device_step_time is None:
+                    prev_device_step_time_str = "N/A"  # Handle first iteration
+                else:
+                    prev_device_step_time_str = f"{prev_device_step_time:.3f}ms"
@@
-                    f"host_step_time = {end_time - start_time}s, "
-                    f"prev_device_step_time = {prev_device_step_time}, "
+                    f"host_step_time = {((end_time - start_time) * 1e3):.3f}ms, "
+                    f"prev_device_step_time = {prev_device_step_time_str}, "
@@
-            start_time = time.time()
+            start_time = time.perf_counter()

Also applies to: 456-460, 468-470, 483-483


1428-1430: Fix long NVTX label line (Ruff E501) by splitting the f-string

This preserves the new 1-based labeling while satisfying the linter.

-        @nvtx_range(
-            f"[Executor] _forward_step {self.model_engine.iter_counter + 1}: {len(scheduled_requests.context_requests)} ctx reqs, {len(scheduled_requests.generation_requests)} gen reqs"
-        )
+        @nvtx_range(
+            f"[Executor] _forward_step {self.model_engine.iter_counter + 1}: "
+            f"{len(scheduled_requests.context_requests)} ctx reqs, "
+            f"{len(scheduled_requests.generation_requests)} gen reqs"
+        )

1-1: Add NVIDIA copyright header (current year) per guidelines

The file is missing the required NVIDIA header.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

If the repository standard also includes an SPDX identifier (e.g., Apache-2.0), please append it accordingly.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between afb116f and f64ffcf.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py

432-432: Line too long (123 > 120)

(E501)


1428-1428: Line too long (185 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

444-455: Alternating two event pairs to report the previous iteration’s device time is sound

The parity-based event pair scheme correctly avoids a device-wide sync while giving you the prior iter duration. Once recorded on the right stream, this will be both accurate and low overhead.

@kaiyux
Copy link
Member Author

kaiyux commented Aug 14, 2025

/bot run

@kaiyux kaiyux requested a review from QiJune August 14, 2025 15:58
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

406-414: Record CUDA events on the actual compute stream to ensure accurate device timings (and fix comment grammar).

Recording without specifying a stream risks skewed device timings when compute runs on non-default streams (sampler/forward can switch streams). Track the stream at iteration start and record both start/end events on it. Also tweak the comment grammar.

Apply this diff:

-        # These events are used to record the time of the previous batch.
-        # We need two set of the start-end events to record the time through
-        # a ping-pong way so that it works with overlap scheduler.
+        # These events are used to record the time of the previous batch.
+        # We need two sets of start/end events to record the time in a ping-pong
+        # fashion so that it works with the overlap scheduler.
         start_event_1 = None
         end_event_1 = torch.cuda.Event(enable_timing=True)
         start_event_2 = None
         end_event_2 = torch.cuda.Event(enable_timing=True)
         prev_device_step_time = None
+        # Track the streams used to record timing events to avoid default-stream skew.
+        stream_1 = None
+        stream_2 = None

450-462: Record end events on the same stream as their corresponding start events.

Without passing the stream handle, Event.record() uses the current/default stream, which can under-report device time if compute ran on a different stream.

Apply this diff:

-                if it % 2 == 0:
-                    end_event_1.record()
+                if it % 2 == 0:
+                    end_event_1.record(stream_1)
                     if start_event_2 is not None:
                         end_event_2.synchronize()
                         prev_device_step_time = start_event_2.elapsed_time(
                             end_event_2)
                 else:
-                    end_event_2.record()
+                    end_event_2.record(stream_2)
                     if start_event_1 is not None:
                         end_event_1.synchronize()
                         prev_device_step_time = start_event_1.elapsed_time(
                             end_event_1)

492-499: Record start events on the compute stream captured at iteration start.

Capture the current compute stream and record the start event on it; this pairs with the end-event change above.

Apply this diff:

             if it % 2 == 0:
                 if start_event_1 is None:
                     start_event_1 = torch.cuda.Event(enable_timing=True)
-                start_event_1.record()
+                stream_1 = torch.cuda.current_stream(self.device_id)
+                start_event_1.record(stream_1)
             else:
                 if start_event_2 is None:
                     start_event_2 = torch.cuda.Event(enable_timing=True)
-                start_event_2.record()
+                stream_2 = torch.cuda.current_stream(self.device_id)
+                start_event_2.record(stream_2)
🧹 Nitpick comments (4)
tensorrt_llm/_torch/pyexecutor/py_executor.py (4)

437-437: Split long nonlocal (Ruff E501) and add stream handles to scope.

This line exceeds 120 chars and is flagged by Ruff. Split it and add the newly introduced stream variables.

Apply this diff:

-            nonlocal it, enabled, start_time, start_event_1, end_event_1, start_event_2, end_event_2, prev_device_step_time
+            nonlocal it, enabled, start_time
+            nonlocal start_event_1, end_event_1, start_event_2, end_event_2
+            nonlocal prev_device_step_time, stream_1, stream_2

463-467: Format timing values consistently and avoid mixing types.

Right now prev_device_step_time gets turned into a string (or “N/A”), while host_step_time remains a float. Round consistently and keep explicit “_ms” variable names for clarity.

Apply this diff:

-                if prev_device_step_time is None:
-                    prev_device_step_time = "N/A"  # Handle first iteration
-                else:
-                    prev_device_step_time = f"{prev_device_step_time}ms"
-                host_step_time = (end_time - start_time) * 1000  # milliseconds
+                if prev_device_step_time is None:
+                    prev_device_step_time_str = "N/A"  # Handle first iteration
+                else:
+                    prev_device_step_time_str = f"{prev_device_step_time:.3f}ms"
+                host_step_time_ms = (end_time - start_time) * 1000.0  # milliseconds

476-477: Align log field names with units and use the formatted variables.

Follow-up to the previous change: use host_step_time_ms and prev_device_step_time_str, and format host time to 3 decimals.

Apply this diff:

-                    f"host_step_time = {host_step_time}ms, "
-                    f"prev_device_step_time = {prev_device_step_time}, "
+                    f"host_step_time = {host_step_time_ms:.3f}ms, "
+                    f"prev_device_step_time = {prev_device_step_time_str}, "

1436-1436: Fix long NVTX label (Ruff E501) and keep logging consistent with 1-based indexing.

Split the string to satisfy E501. Also, since the NVTX label moved to 1-based indexing, consider aligning the logged iteration field to avoid confusion.

Apply this diff:

-        @nvtx_range(
-            f"[Executor] _forward_step {self.model_engine.iter_counter + 1}: {len(scheduled_requests.context_requests)} ctx reqs, {len(scheduled_requests.generation_requests)} gen reqs"
-        )
+        @nvtx_range(
+            f"[Executor] _forward_step {self.model_engine.iter_counter + 1}: "
+            f"{len(scheduled_requests.context_requests)} ctx reqs, "
+            f"{len(scheduled_requests.generation_requests)} gen reqs"
+        )

Additionally (outside this hunk), update the logged iteration to match 1-based indexing:

# Suggested change near the logger.info(...) call:
# f"iter = {self.model_engine.iter_counter}, "
f"iter = {self.model_engine.iter_counter + 1}, "
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f64ffcf and b7f1acb.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py

437-437: Line too long (123 > 120)

(E501)


1436-1436: Line too long (185 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15317 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15317 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11561 completed with status: 'FAILURE'

Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

kaiyux added 4 commits August 15, 2025 11:18
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
@kaiyux kaiyux force-pushed the user/kaiyu/support_acc_iter_time branch from b7f1acb to 67c7540 Compare August 15, 2025 03:18
@kaiyux
Copy link
Member Author

kaiyux commented Aug 15, 2025

/bot run --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

451-463: Ensure CUDA events are recorded on the correct stream

The current implementation records CUDA events without specifying a stream, which defaults to the current stream. However, if forward/sampler operations use different streams, the timing measurements may be inaccurate. Based on code inspection, sampler.py obtains the current stream and other components may create/use non-default streams.

Capture the actual compute stream at iteration start and record events on that stream:

+        # Track the streams used to record timing events
+        stream_1 = None
+        stream_2 = None
 
         def profile_step():
-            nonlocal it, enabled, start_time, start_event_1, end_event_1, start_event_2, end_event_2, prev_device_step_time
+            nonlocal it, enabled, start_time
+            nonlocal start_event_1, end_event_1, start_event_2, end_event_2
+            nonlocal prev_device_step_time, stream_1, stream_2
 
             # ... existing code ...
             
             if start_time is not None and self.print_log and self.dist.rank == 0:
                 end_time = time.time()
                 if it % 2 == 0:
-                    end_event_1.record()
+                    # Record the end event on the same stream as its start
+                    end_event_1.record(stream_1)
                     if start_event_2 is not None:
                         end_event_2.synchronize()
                         prev_device_step_time = start_event_2.elapsed_time(
                             end_event_2)
                 else:
-                    end_event_2.record()
+                    end_event_2.record(stream_2)
                     if start_event_1 is not None:
                         end_event_1.synchronize()
                         prev_device_step_time = start_event_1.elapsed_time(
                             end_event_1)

493-500: Record start events on the correct stream

Similar to the end event recording, start events should be recorded on the actual compute stream to ensure accurate timing measurements.

             if it % 2 == 0:
                 if start_event_1 is None:
                     start_event_1 = torch.cuda.Event(enable_timing=True)
+                stream_1 = torch.cuda.current_stream(self.device_id)
-                start_event_1.record()
+                start_event_1.record(stream_1)
             else:
                 if start_event_2 is None:
                     start_event_2 = torch.cuda.Event(enable_timing=True)
+                stream_2 = torch.cuda.current_stream(self.device_id)
-                start_event_2.record()
+                start_event_2.record(stream_2)
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

438-438: Consider splitting the long line for better readability

Line 438 exceeds the 120-character limit. Consider splitting the nonlocal declaration across multiple lines.

-            nonlocal it, enabled, start_time, start_event_1, end_event_1, start_event_2, end_event_2, prev_device_step_time
+            nonlocal it, enabled, start_time
+            nonlocal start_event_1, end_event_1, start_event_2, end_event_2
+            nonlocal prev_device_step_time
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b7f1acb and 67c7540.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py

438-438: Line too long (123 > 120)

(E501)


1437-1437: Line too long (185 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tensorrt_llm/_torch/pyexecutor/py_executor.py (4)

406-415: Add device-side timing measurement using CUDA events

The implementation correctly introduces CUDA event-based timing to measure device-side execution time, complementing the existing host-side timing. The ping-pong approach with two event pairs allows proper measurement even with overlapping operations.


464-468: Good handling of the first iteration case

The code correctly handles the first iteration where no previous timing is available by setting prev_device_step_time to "N/A", preventing any display issues in the logs.


469-481: Clear and informative profiling output format

The updated log format clearly distinguishes between host and device timing, making it easier to identify performance bottlenecks. The inclusion of both host_step_time and prev_device_step_time provides valuable insights for performance analysis.


1437-1437: Update NVTX range label to use 1-based iteration indexing

The change from self.model_engine.iter_counter to self.model_engine.iter_counter + 1 provides clearer 1-based iteration numbering in profiling output, which is more intuitive for users. The line exceeds 120 characters but is acceptable as it maintains readability of the complex string formatting.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15385 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15385 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11597 completed with status: 'FAILURE'

@kaiyux
Copy link
Member Author

kaiyux commented Aug 17, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15524 [ run ] triggered by Bot

@kaiyux kaiyux enabled auto-merge (squash) August 17, 2025 03:52
@tensorrt-cicd
Copy link
Collaborator

PR_Github #15524 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11691 completed with status: 'FAILURE'

@kaiyux
Copy link
Member Author

kaiyux commented Aug 18, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15559 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15559 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11719 completed with status: 'SUCCESS'

@kaiyux kaiyux merged commit e88cb92 into NVIDIA:main Aug 18, 2025
5 checks passed
@kaiyux kaiyux deleted the user/kaiyu/support_acc_iter_time branch August 18, 2025 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants