Skip to content

Conversation

@h-guo18
Copy link
Collaborator

@h-guo18 h-guo18 commented Oct 13, 2025

Summary by CodeRabbit

  • New Features

    • Unified attention export pathway that routes to the optimized custom attention op.
    • Automatic remapping of attention keyword arguments for smoother compatibility.
  • Refactor

    • Centralized attention-argument mapping reused across components for consistency.
    • Removed legacy runtime attention patching in favor of an export-time mechanism, reducing intrusive overrides.

Description

Added a export patch to replace hf.attention_interface directly to the unified autodeploy::torch_attention. Map keywords like sliding_window, scaling,sink during patching.

This allow us to support the attention keywords above for different models generally.

Test Coverage

Build_and_run

Tested on unsloth/gpt-oss-20b-BF16 and llama-3.1-8B, got identical output with upstream/main and num_matches=0 in match_eager_attn and match_grouped_attn transform, indicating attention matching is bypassed:

  • python examples/auto_deploy/build_and_run_ad.py --model unsloth/gpt-oss-20b-BF16 --args.world_size 0 --args.runtime "demollm" --args.attn_backend torch --args.compile_backend torch-simple
...
[10/13/2025-06:06:38] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [stage=pattern_matcher, transform=match_eager_attention] enabled=True, num_matches=0, is_clean=True, has_valid_shapes=False
...
[10/13/2025-06:06:39] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [stage=pattern_matcher, transform=match_grouped_attention] enabled=True, num_matches=0, is_clean=True, has_valid_shapes=False
...
[10/13/2025-06:00:06] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 0] How big is the universe? : 10^10 ly? 10^12 ly? 10^13 ly? 10^14 ly? 10^15 ly? 10^16 ly? 10^17 ly? 10^18 ly? 10^19 ly? 10^20 ly? 10^21 ly? 10^22 ly? 10^23 ly? 10^24 ly? 10^25 ly? 10^26 ly? 10^27 ly
[10/13/2025-06:00:06] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: : 1) The 3rd law of physics?

Gravity is a force that attracts objects with mass towards each other, following the principle that for every action, there is an equal and opposite reaction.

The 3rd law of

The third law of physics states that for every action, there is an equal and opposite reaction.

The 3

The third law of physics states that for every action, there is an equal and opposite reaction.

The third law of physics states that for
[10/13/2025-06:00:06] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Destroying process group

NOTE: Output is strange but this is identical to result on upstream/main.

  • python examples/auto_deploy/build_and_run_ad.py --model meta-llama/Llama-3.1-8B-Instruct --args.world_size 0 --args.runtime "demollm"
...
[10/13/2025-06:08:46] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [stage=pattern_matcher, transform=match_eager_attention] enabled=True, num_matches=0, is_clean=True, has_valid_shapes=False
...
[10/13/2025-06:08:47] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [stage=pattern_matcher, transform=match_grouped_attention] enabled=True, num_matches=0, is_clean=True, has_valid_shapes=False
...
[10/13/2025-06:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 0] How big is the universe? :  The short answer is, we don’t know.  The universe is so vast that it is difficult to comprehend its size.  However, we can try to put it into perspective by looking at some of the numbers that have been estimated by scientists.
The observable universe is estimated to be around 93 billion light-years in diameter.  A light-year is the distance light travels in one year, which is about 6 trillion miles (9.7 trillion kilometers).  So,
[10/13/2025-06:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: :  Gravity is a force that pulls objects towards each other, with the strength of the force depending on the mass of the objects and the distance between them.
What is the concept of gravity?
Gravity is a fundamental force of nature that causes objects with mass to attract each other. The strength of the gravitational force depends on the mass of the objects and the distance between them. The more massive the objects and the closer they are to each other, the stronger the gravitational force between them.
What is the definition
[10/13/2025-06:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Destroying process group

Unit test

Only failing same cases as upstream/main:

FAILED tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3.py::test_get_extra_inputs_includes_image_sizes - AssertionError: assert mappingproxy(...doc__': None}) == mappingproxy(...doc__': None})
FAILED tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_patches.py::test_build_run_mistral3_vlm - torch._dynamo.exc.UncapturedHigherOrderOpError: Cond doesn't work unless it is captured completely with torch.compile. Scroll up to fi...
FAILED tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py::test_build_ad[mistralai/Mistral-Small-3.1-24B-Instruct-2503-graph] - torch._dynamo.exc.UncapturedHigherOrderOpError: Cond doesn't work unless it is captured completely with torch.compile. Scroll up to fi...

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@h-guo18 h-guo18 self-assigned this Oct 13, 2025
@h-guo18 h-guo18 requested a review from lucaslie October 13, 2025 06:21
@h-guo18 h-guo18 force-pushed the haoguo/attn_export_patch branch from cec0648 to 9bb892d Compare October 13, 2025 06:39
@h-guo18 h-guo18 marked this pull request as ready for review October 13, 2025 06:39
@h-guo18 h-guo18 requested a review from a team as a code owner October 13, 2025 06:39
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 13, 2025

📝 Walkthrough

Walkthrough

Introduces a unified attention export path by registering a Torch-op-backed wrapper and patching te.export to select it. Centralizes HF-to-operator kwargs mapping in a shared constant. Removes legacy GPT-OSS attention patching. Updates a KV-cache transformer to use the shared mapping. Adjusts a unit test skip condition grouping.

Changes

Cohort / File(s) Summary of Changes
Unified attention export patch
tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py
Added torch_attention_hf_wrapper that adapts HF tensors/kwargs and calls auto_deploy.torch_attention. Added UnifiedAttnPatch to register wrapper in ALL_ATTENTION_FUNCTIONS, patch te.export to inject _attn_implementation, and provide apply/revert lifecycle.
Shared HF→op kwargs mapping
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py, tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py
Introduced HF_ATTN_KWARGS_MAPPING dict for attention kwargs translation. Replaced hard-coded mapping logic in KV-cache transformer with iteration over the shared mapping. No functional changes to the custom op.
Remove GPT-OSS attention patching
tensorrt_llm/_torch/auto_deploy/models/patches/gptoss.py
Deleted custom GPT-OSS attention forward, model construction patching, and override of AutoModelForCausalLM.from_config. Removed related helpers and registries.
Test condition restructuring
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
Grouped the skip predicate so the mode == "transformers" condition uniformly gates all model checks, altering test skip control flow.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant te as torch.export (te.export)
  participant Patch as UnifiedAttnPatch
  participant Reg as ALL_ATTENTION_FUNCTIONS
  participant Model as HF Model
  participant Wrap as torch_attention_hf_wrapper
  participant Op as auto_deploy.torch_attention

  rect rgba(220,235,255,0.35)
    note over Patch,Reg: Patch application (new)
    User->>Patch: apply()
    Patch->>Reg: Reg["ad_unified_attn"] = Wrap
    Patch->>te: patch te.export -> _export_with_unified_attn (sets config._attn_implementation)
  end

  User->>te: export(Model, ...)
  te->>Model: run with _attn_implementation="ad_unified_attn"
  Model->>Wrap: attention(q,k,v, attention_mask, **hf_kwargs)
  note over Wrap: Reorder dims to "bsnd"<br/>Map kwargs via HF_ATTN_KWARGS_MAPPING
  Wrap->>Op: torch_attention(q',k',v', mask, layout="bsnd", **mapped_kwargs)
  Op-->>Wrap: output
  Wrap-->>Model: output (restored layout)
  Model-->>te: exported program

  rect rgba(255,230,230,0.35)
    note over Patch,Reg: Revert (on cleanup)
    User->>Patch: revert()
    Patch->>Reg: remove "ad_unified_attn"
    Patch->>te: restore original te.export
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly captures the primary change of replacing the unified HF attention mechanism during export and follows the required [ticket][type] Summary format, making it clear and specific enough for a teammate scanning history to understand the core update.
Description Check ✅ Passed The pull request description includes all required sections from the template. The PR title follows the specified format with [#4585][feat], clearly identifying the GitHub issue reference and feature type. The Description section concisely explains the change (adding an export patch to replace hf.attention_interface with autodeploy::torch_attention) and the rationale (supporting attention keywords across different models). The Test Coverage section is comprehensive, providing specific build_and_run test commands for two models with corresponding log outputs demonstrating matching behavior. The PR Checklist is included with the confirmation checkbox marked as completed. While the description could be slightly more detailed in some areas, all required elements are present and substantially filled out.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a90449 and 9bb892d.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/models/patches/gptoss.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py (2 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (1 hunks)
💤 Files with no reviewable changes (1)
  • tensorrt_llm/_torch/auto_deploy/models/patches/gptoss.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
  • tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
  • tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
  • tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)
  • torch_attention (109-225)
tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)
  • torch_attention (109-225)
tensorrt_llm/_torch/auto_deploy/export/interface.py (2)
  • BaseExportPatch (38-130)
  • ExportPatchRegistry (166-213)
🪛 Ruff (0.13.3)
tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py

14-14: Unused function argument: self

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 13, 2025

📝 Walkthrough

Walkthrough

Introduces a centralized HF-to-custom-op kwargs mapping, adds a unified export-time attention wrapper and patch that reroutes HuggingFace attention to a custom op, removes the prior GPT OSS patching module, updates a transformer-side KV cache transform to use the new mapping, and adjusts a unit test’s skip condition logic.

Changes

Cohort / File(s) Summary
Centralized HF→custom-op kwargs mapping
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py, tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py
Adds module-level HF_ATTN_KWARGS_MAPPING; refactors KV-cache transformer to consume this mapping instead of a local dict when building op kwargs.
Unified export-time attention patch
tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py
Adds torch_attention_hf_wrapper that reorders Q/K/V, remaps kwargs, and calls torch.ops.auto_deploy.torch_attention. Introduces UnifiedAttnPatch to register wrapper and monkey-patch te.export based on model config.
Removal of GPT OSS patching
tensorrt_llm/_torch/auto_deploy/models/patches/gptoss.py
Removes entire GPT OSS attention patch module, including forward wrapper, registry, and reassignment of AutoModelForCausalLM.from_config.
Unit test skip logic adjustment
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
Consolidates skip condition to only trigger in "transformers" mode for listed models; removes mode-agnostic skip paths due to operator precedence.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User as User
  participant Export as te.export (patched)
  participant Model as HF Model
  participant AttnSel as ALL_ATTENTION_FUNCTIONS
  participant Wrapper as torch_attention_hf_wrapper
  participant Op as torch.ops.auto_deploy.torch_attention

  User->>Export: export(Model)
  rect rgba(220,235,245,0.4)
    note right of Export: Patched export path when model.config._attn_implementation is set
    Export->>AttnSel: register ad_unified_attn
    Export->>Model: run export with patched attention
    Model->>Wrapper: attention(query, key, value, mask, **hf_kwargs)
    Wrapper->>Wrapper: reorder Q/K/V<br/>remap kwargs via HF_ATTN_KWARGS_MAPPING
    Wrapper->>Op: torch_attention(q,k,v, mask, **mapped_kwargs)
    Op-->>Wrapper: output
    Wrapper-->>Model: output
  end
  Model-->>Export: graph
  Export-->>User: exported artifact
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title “[#4585][feat] Replace unified attention before export” follows the repository’s ticket format and succinctly describes the primary change of swapping out the unified attention interface prior to export.
Description Check ✅ Passed The pull request description includes the required “## Description”, “## Test Coverage”, and “## PR Checklist” sections and clearly explains the change, outlines relevant test results, and lists the failing tests matching upstream. It adheres to the repository template and provides sufficient context for reviewers.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (1)

1-1: Add NVIDIA Apache-2.0 header at file top.

Per repo guidelines, prepend the copyright header.

As per coding guidelines

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py (1)

1-1: Add NVIDIA Apache-2.0 header at file top.

Please prepend the required header.

As per coding guidelines

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2)

1-1: Add NVIDIA Apache-2.0 header at file top.

Please prepend the required header.

As per coding guidelines

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

18-31: Boolean attention mask conversion is incorrect (semantic bug).

In PyTorch SDPA, boolean masks use True to indicate positions to mask out. Current conversion maps True→1.0, False→-inf, biasing allowed positions and inverting semantics.

Apply:

-def _convert_boolean_mask_to_float(attn_mask: torch.Tensor, dtype: torch.dtype) -> torch.Tensor:
-    """Convert boolean attention mask to floating point mask.
-    Args:
-        attn_mask: Boolean tensor where True allows attention, False blocks it
-        dtype: Target dtype for the output mask
-    Returns:
-        Floating point mask where True -> 1.0, False -> -inf
-    """
-    if attn_mask.dtype == torch.bool:
-        float_mask = torch.zeros_like(attn_mask, dtype=dtype)
-        float_mask = float_mask.masked_fill(attn_mask, 1.0)  # True -> 1.0
-        float_mask = float_mask.masked_fill(~attn_mask, float("-inf"))  # False -> -inf
-        return float_mask
-    return attn_mask
+def _convert_boolean_mask_to_float(attn_mask: torch.Tensor, dtype: torch.dtype) -> torch.Tensor:
+    """Convert boolean attention mask (True = masked) to additive float mask.
+    Returns 0.0 for allowed positions and -inf for masked positions."""
+    if attn_mask.dtype == torch.bool:
+        float_mask = torch.zeros_like(attn_mask, dtype=dtype)  # allowed -> 0.0
+        float_mask = float_mask.masked_fill(attn_mask, float("-inf"))  # True (masked) -> -inf
+        return float_mask
+    return attn_mask

This matches F.scaled_dot_product_attention semantics.

🧹 Nitpick comments (3)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (1)

106-111: Skip logic looks correct for transformers mode.

Condition now only triggers for transformers and specified models; good.

Optionally, tighten readability:

-    if mode == "transformers" and (
-        "DeepSeek-V3" in experiment_config["args"]["model"]
-        or "Phi-3-mini-4k-instruct" in experiment_config["args"]["model"]
-        or "NVIDIA-Nemotron-Nano-12B-v2" in experiment_config["args"]["model"]
-    ):
+    blocked = ("DeepSeek-V3", "Phi-3-mini-4k-instruct", "NVIDIA-Nemotron-Nano-12B-v2")
+    if mode == "transformers" and any(b in experiment_config["args"]["model"] for b in blocked):
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)

94-105: HF kwargs mapping: LGTM; consider making immutable.

Mapping covers common keys (dropout, scaling/scale, sinks, sliding_window, logit_cap, is_causal).

Optionally wrap with MappingProxyType for immutability:

+from types import MappingProxyType
-HF_ATTN_KWARGS_MAPPING = {
+HF_ATTN_KWARGS_MAPPING = MappingProxyType({
     ...
-}
+})
tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py (1)

13-21: Prefix unused module arg to satisfy linter and clarify intent.

Rename self to _module to match HF signature and silence ARG001.

-def torch_attention_hf_wrapper(
-    self: torch.nn.Module,
+def torch_attention_hf_wrapper(
+    _module: torch.nn.Module,
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a90449 and 9bb892d.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/models/patches/gptoss.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py (2 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (1 hunks)
💤 Files with no reviewable changes (1)
  • tensorrt_llm/_torch/auto_deploy/models/patches/gptoss.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
  • tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
  • tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
  • tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)
  • torch_attention (109-225)
tensorrt_llm/_torch/auto_deploy/export/interface.py (2)
  • BaseExportPatch (38-130)
  • ExportPatchRegistry (166-213)
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)
  • torch_attention (109-225)
🪛 Ruff (0.13.3)
tensorrt_llm/_torch/auto_deploy/export/library/unified_attn.py

14-14: Unused function argument: self

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Check PR Checklist Resolution
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache_transformers.py (1)

14-15: Centralized kwargs mapping: LGTM.

Importing and iterating HF_ATTN_KWARGS_MAPPING removes duplication and reduces drift; good alignment with export wrapper.

Ensure HF_ATTN_KWARGS_MAPPING stays in sync with any future HF attention_interface keys.

Also applies to: 43-46

Copy link
Member

@lucaslie lucaslie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. Just a few minor comments. Please slack me when those are resolved and I can kick off a dashboard run

@h-guo18 h-guo18 force-pushed the haoguo/attn_export_patch branch from 3d2a36c to a280014 Compare October 23, 2025 05:05
@h-guo18
Copy link
Collaborator Author

h-guo18 commented Oct 23, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22243 [ run ] triggered by Bot. Commit: a280014

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22243 [ run ] completed with state SUCCESS. Commit: a280014
/LLM/main/L0_MergeRequest_PR pipeline #16770 completed with status: 'FAILURE'

@h-guo18 h-guo18 requested a review from lucaslie October 23, 2025 18:41
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
@h-guo18 h-guo18 force-pushed the haoguo/attn_export_patch branch from a280014 to d42f58b Compare October 23, 2025 19:00
@lucaslie lucaslie enabled auto-merge (squash) October 23, 2025 19:10
@h-guo18
Copy link
Collaborator Author

h-guo18 commented Oct 23, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22317 [ run ] triggered by Bot. Commit: d42f58b

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22317 [ run ] completed with state SUCCESS. Commit: d42f58b
/LLM/main/L0_MergeRequest_PR pipeline #16825 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@lucaslie lucaslie merged commit 2392022 into NVIDIA:main Oct 23, 2025
5 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Oct 24, 2025
)

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 1, 2025
)

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
)

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
)

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
)

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants