-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][fix] Revert "Revert "[None][feat] support attention dp for qwen3 dense model"" #7780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][fix] Revert "Revert "[None][feat] support attention dp for qwen3 dense model"" #7780
Conversation
|
/bot run |
|
PR_Github #18860 [ run ] triggered by Bot |
📝 WalkthroughWalkthroughAdds DP-aware all-reduce gating to Qwen3 attention and MLP using AllReduceParams, driven by mapping and enable_attention_dp. Passes token distribution metadata into MLP. Updates Qwen3Model to propagate mapping to Embedding. Introduces constructor changes for GatedMLP (overridden_tp_size) and Embedding (mapping). Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant App as Caller
participant Model as Qwen3Model
participant Layer as Qwen3DecoderLayer
participant Attn as Attention
participant MLP as GatedMLP
App->>Model: forward(input_ids, mapping, enable_attention_dp)
Model->>Layer: forward(hidden_states, mapping, enable_attention_dp)
Note over Layer: Compute disable_allreduce from mapping.tp_size and enable_attention_dp
Layer->>Attn: forward(..., all_reduce_params.enable= !disable_allreduce)
Attn-->>Layer: attn_output, attn_metadata(all_rank_num_tokens)
Note over Layer: If DP enabled, override TP in MLP (overridden_tp_size=1)
Layer->>MLP: forward(..., all_rank_num_tokens, all_reduce_params.enable= !disable_allreduce, cutlass_min_latency_mode=False)
MLP-->>Layer: mlp_output
Layer-->>Model: layer_output
Model-->>App: logits/outputs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/models/modeling_qwen3.py (1)
1-2: Add the NVIDIA Apache-2.0 header (2025).Repository guideline requires the NVIDIA Apache-2.0 header on all .py files.
Apply at file top:
+ # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # Licensed under the Apache License, Version 2.0 (the "License"); + # you may not use this file except in compliance with the License. + # You may obtain a copy of the License at + # http://www.apache.org/licenses/LICENSE-2.0 + # Unless required by applicable law or agreed to in writing, software + # distributed under the License is distributed on an "AS IS" BASIS, + # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + # See the License for the specific language governing permissions and + # limitations under the License.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/models/modeling_qwen3.py (3)
86-88: Harden access to mapping.enable_attention_dp.If mapping comes from external config variants, attribute may be missing. Use a guarded lookup.
Apply this diff:
self.mapping = model_config.mapping -self.enable_attention_dp = self.mapping.enable_attention_dp +self.enable_attention_dp = bool(getattr(self.mapping, "enable_attention_dp", False))
109-110: Use positive logic for all-reduce gating for clarity; semantics unchanged.This avoids double negation at call sites.
Apply this diff:
- self.disable_allreduce = (self.mapping.tp_size == 1 - or self.enable_attention_dp) + self.enable_allreduce = (self.mapping.tp_size > 1 + and not self.enable_attention_dp) @@ - all_reduce_params=AllReduceParams( - enable_allreduce=not self.disable_allreduce), + all_reduce_params=AllReduceParams( + enable_allreduce=self.enable_allreduce), @@ - final_all_reduce_params=AllReduceParams( - enable_allreduce=not self.disable_allreduce), + final_all_reduce_params=AllReduceParams( + enable_allreduce=self.enable_allreduce),Also applies to: 135-136, 147-147
135-136: Avoid per-forward AllReduceParams allocations.AllReduce enablement is per-layer constant; pre-create and reuse to reduce Python object churn.
Apply in init (near Line 109):
+ self._ar_params = AllReduceParams(enable_allreduce=self.enable_allreduce)Then update calls:
- all_reduce_params=AllReduceParams( - enable_allreduce=self.enable_allreduce), + all_reduce_params=self._ar_params,
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/models/modeling_qwen3.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/models/modeling_qwen3.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/models/modeling_qwen3.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/models/modeling_qwen3.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/models/modeling_qwen3.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/models/modeling_qwen3.py (1)
tensorrt_llm/functional.py (1)
AllReduceParams(3900-3939)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/models/modeling_qwen3.py (2)
143-149: Validate MLP call surface (params existence and types).
- Ensure AttentionMetadata exposes all_rank_num_tokens on all code paths (check tensorrt_llm/_torch/attention_backend/interface.py and where pyexecutor sets attn_metadata.all_rank_num_tokens).
- Confirm the MLP implementation invoked by modeling_qwen3 (self.mlp) accepts the named args final_all_reduce_params and cutlass_min_latency_mode (or accepts **kwargs). If not, either rename the call to the MLP’s expected parameter (e.g., all_reduce_params) or update GatedMLP.forward to accept those params. Location: tensorrt_llm/_torch/models/modeling_qwen3.py:143-149; candidate implementation: tensorrt_llm/_torch/modules/gated_mlp.py.
- Benchmark impact of cutlass_min_latency_mode=False on small-batch inference before changing the default.
98-98: Confirmed: modules.GatedMLP accepts overridden_tp_size; modeling_qwen3 call updated.tensorrt_llm/_torch/modules/gated_mlp.py init includes overridden_tp_size; tensorrt_llm/_torch/models/modeling_qwen3.py passes overridden_tp_size=1 if self.enable_attention_dp else None. Note: a separate GatedMLP exists at tensorrt_llm/layers/mlp.py that does NOT have overridden_tp_size — ensure callers importing that module are not using this kwarg.
|
PR_Github #18860 [ run ] completed with state |
|
/bot run |
|
PR_Github #18893 [ run ] triggered by Bot |
|
PR_Github #18893 [ run ] completed with state |
f85ccd1 to
c2ea528
Compare
|
/bot run |
|
PR_Github #18954 [ run ] triggered by Bot |
|
PR_Github #18954 [ run ] completed with state |
|
/bot run |
|
PR_Github #19023 [ run ] triggered by Bot |
|
PR_Github #19023 [ run ] completed with state |
…el" (NVIDIA#7765)" This reverts commit 8226ef2. Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
c2ea528 to
b7b9904
Compare
|
/bot run |
|
PR_Github #19090 [ run ] triggered by Bot |
|
PR_Github #19090 [ run ] completed with state |
QiJune
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…n3 dense model"" (NVIDIA#7780) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
…n3 dense model"" (NVIDIA#7780) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
…n3 dense model"" (NVIDIA#7780) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Reverts #7765
Summary by CodeRabbit
New Features
Refactor