-
Notifications
You must be signed in to change notification settings - Fork 2k
[TRTLLM-8957][feat] create communication related classes #8968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/bot run |
|
PR_Github #23724 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughIntroduces a new MoE communication strategy module with a base interface, five concrete implementations (AllGatherReduceScatter, MnnvlLatency, MNNVLThroughput, DeepEP, DeepEPLowLatency), a factory for runtime strategy selection, and a package initializer exposing the public API. Changes
Sequence DiagramsequenceDiagram
participant Factory
participant Strategy as Communication Strategy
participant Dispatch as Dispatch Phase
participant Combine as Combine Phase
Factory->>Factory: Evaluate environment & mapping<br/>(topology, dtype, flags)
Factory->>Strategy: Create strategy instance<br/>(AllGather/MNNVL/DeepEP)
Strategy->>Dispatch: prepare_dispatch()<br/>(optional metadata prep)
Dispatch->>Dispatch: Store alltoall_info or<br/>dispatch state
Dispatch-->>Strategy: Return local statistics
Strategy->>Dispatch: dispatch()<br/>(hidden_states, tokens, scales)
Dispatch->>Dispatch: Execute AllGather/<br/>MNNVL/DeepEP<br/>operation
Dispatch->>Dispatch: Store dispatch_state<br/>(sizes, handles, etc.)
Dispatch-->>Strategy: Return dispatched tensors
Strategy->>Combine: combine()<br/>(final_hidden_states)
Combine->>Combine: Read dispatch_state<br/>Reverse communication<br/>(ReduceScatter/Combine)
Combine-->>Strategy: Return final output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Areas requiring extra attention:
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (1)
129-170: Post-quant dispatch logic is correct; consider marking unused variable.The post-quantization dispatch correctly handles NVFP4 with dtype adapters for
hidden_states_sf, applies padding, and stores state. The TODO comment on line 134 appropriately notes the adapter is temporary.Consider marking the unused unpacked variable with an underscore to indicate it's intentionally ignored:
( (hidden_states, hidden_states_sf), recv_topk_idx, token_final_scales, - num_recv_tokens_per_expert_list, + _, # num_recv_tokens_per_expert_list (unused) deep_ep_handle, ) = self.deep_ep_buffer.dispatch(
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/communication/base.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.pytensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.pytensorrt_llm/_torch/modules/fused_moe/communication/base.pytensorrt_llm/_torch/modules/fused_moe/communication/__init__.pytensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.pytensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.pytensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.pytensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.pytensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.pytensorrt_llm/_torch/modules/fused_moe/communication/base.pytensorrt_llm/_torch/modules/fused_moe/communication/__init__.pytensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.pytensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.pytensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.pytensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.pytensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.pytensorrt_llm/_torch/modules/fused_moe/communication/base.pytensorrt_llm/_torch/modules/fused_moe/communication/__init__.pytensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.pytensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.pytensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.pytensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.pytensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py
📚 Learning: 2025-09-02T13:42:44.885Z
Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py
🧬 Code graph analysis (8)
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py (5)
tensorrt_llm/mapping.py (2)
Mapping(336-493)moe_ep_rank(205-206)tensorrt_llm/models/modeling_utils.py (1)
QuantConfig(131-271)tensorrt_llm/_torch/modules/fused_moe/communication/base.py (5)
Communication(20-167)supports_post_quant_dispatch(63-73)can_use(55-61)dispatch(103-142)combine(145-167)tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (5)
get_low_latency_buffer(234-244)low_latency_dispatch(142-158)low_latency_dispatch_fp4(172-183)low_latency_combine_low_precision(185-200)low_latency_combine(160-170)tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (5)
is_supported(60-68)supports_post_quant_dispatch(70-76)can_use(78-82)dispatch(84-170)combine(172-188)
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py (9)
tensorrt_llm/_utils.py (1)
local_mpi_size(557-558)tensorrt_llm/mapping.py (1)
Mapping(336-493)tensorrt_llm/models/modeling_utils.py (1)
QuantConfig(131-271)tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py (1)
AllGatherReduceScatter(21-92)tensorrt_llm/_torch/modules/fused_moe/communication/base.py (1)
Communication(20-167)tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (2)
DeepEP(20-226)is_supported(60-68)tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py (2)
DeepEPLowLatency(20-344)is_supported(69-79)tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py (2)
MnnvlLatency(23-187)is_supported(65-69)tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py (2)
MNNVLThroughput(24-370)is_supported(156-160)
tensorrt_llm/_torch/modules/fused_moe/communication/base.py (7)
tensorrt_llm/mapping.py (1)
moe_ep_rank(205-206)tensorrt_llm/models/modeling_utils.py (1)
QuantConfig(131-271)tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py (3)
can_use(47-49)dispatch(51-77)combine(79-92)tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (4)
can_use(78-82)supports_post_quant_dispatch(70-76)dispatch(84-170)combine(172-188)tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py (4)
can_use(90-97)supports_post_quant_dispatch(81-88)dispatch(99-236)combine(238-286)tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py (5)
can_use(77-79)supports_post_quant_dispatch(71-75)prepare_dispatch(81-111)dispatch(113-160)combine(162-187)tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py (4)
can_use(168-170)supports_post_quant_dispatch(162-166)dispatch(172-266)combine(268-336)
tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py (7)
tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py (1)
AllGatherReduceScatter(21-92)tensorrt_llm/_torch/modules/fused_moe/communication/base.py (1)
Communication(20-167)tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py (1)
CommunicationFactory(56-315)tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (1)
DeepEP(20-226)tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py (1)
DeepEPLowLatency(20-344)tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py (1)
MnnvlLatency(23-187)tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py (1)
MNNVLThroughput(24-370)
tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py (3)
tensorrt_llm/mapping.py (1)
Mapping(336-493)tensorrt_llm/models/modeling_utils.py (1)
QuantConfig(131-271)tensorrt_llm/_torch/modules/fused_moe/communication/base.py (4)
Communication(20-167)can_use(55-61)dispatch(103-142)combine(145-167)
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py (4)
tensorrt_llm/_mnnvl_utils.py (3)
MnnvlMemory(53-338)as_torch_strided_tensor(84-88)supports_mnnvl(332-338)tensorrt_llm/models/modeling_utils.py (1)
QuantConfig(131-271)tensorrt_llm/_torch/modules/fused_moe/communication/base.py (5)
Communication(20-167)supports_post_quant_dispatch(63-73)can_use(55-61)dispatch(103-142)combine(145-167)tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py (5)
is_supported(65-69)supports_post_quant_dispatch(71-75)can_use(77-79)dispatch(113-160)combine(162-187)
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py (2)
tensorrt_llm/_mnnvl_utils.py (8)
MnnvlMemory(53-338)MnnvlMoe(352-624)get_moe_workspaces(360-376)get_moe_prepare_workspace(379-390)supports_mnnvl(332-338)mnnvl_moe_alltoallv_prepare_without_allgather(402-446)mnnvl_moe_alltoallv(531-592)mnnvl_moe_alltoallv_combine(595-624)tensorrt_llm/_torch/modules/fused_moe/communication/base.py (4)
Communication(20-167)supports_post_quant_dispatch(63-73)dispatch(103-142)combine(145-167)
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (4)
tensorrt_llm/mapping.py (1)
Mapping(336-493)tensorrt_llm/models/modeling_utils.py (1)
QuantConfig(131-271)tensorrt_llm/_torch/modules/fused_moe/communication/base.py (4)
supports_post_quant_dispatch(63-73)can_use(55-61)dispatch(103-142)combine(145-167)tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py (5)
is_supported(69-79)supports_post_quant_dispatch(81-88)can_use(90-97)dispatch(99-236)combine(238-286)
🪛 Ruff (0.14.3)
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
69-69: Unused static method argument: mapping
(ARG004)
106-106: Unused method argument: use_dp_padding
(ARG002)
109-109: Unused method argument: kwargs
(ARG002)
220-220: Avoid specifying long messages outside the exception class
(TRY003)
241-241: Unused method argument: payload_in_workspace
(ARG002)
242-242: Unused method argument: alltoall_result_do_sum
(ARG002)
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
315-315: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/_torch/modules/fused_moe/communication/base.py
77-77: Unused method argument: token_selected_slots
(ARG002)
78-78: Unused method argument: all_rank_num_tokens
(ARG002)
79-79: Unused method argument: local_statistic_tensor
(ARG002)
tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py
25-37: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py
47-47: Unused method argument: all_rank_num_tokens
(ARG002)
47-47: Unused method argument: num_chunks
(ARG002)
59-59: Unused method argument: post_quant_dispatch
(ARG002)
60-60: Unused method argument: kwargs
(ARG002)
82-82: Unused method argument: payload_in_workspace
(ARG002)
83-83: Unused method argument: alltoall_result_do_sum
(ARG002)
84-84: Unused method argument: kwargs
(ARG002)
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py
168-168: Unused method argument: all_rank_num_tokens
(ARG002)
168-168: Unused method argument: num_chunks
(ARG002)
178-178: Unused method argument: all_rank_num_tokens
(ARG002)
179-179: Unused method argument: use_dp_padding
(ARG002)
180-180: Unused method argument: post_quant_dispatch
(ARG002)
181-181: Unused method argument: kwargs
(ARG002)
201-201: Avoid specifying long messages outside the exception class
(TRY003)
272-272: Unused method argument: alltoall_result_do_sum
(ARG002)
273-273: Unused method argument: kwargs
(ARG002)
290-290: Avoid specifying long messages outside the exception class
(TRY003)
299-299: Avoid specifying long messages outside the exception class
(TRY003)
313-315: Avoid specifying long messages outside the exception class
(TRY003)
354-356: Avoid specifying long messages outside the exception class
(TRY003)
360-360: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py
77-77: Unused method argument: all_rank_num_tokens
(ARG002)
77-77: Unused method argument: num_chunks
(ARG002)
120-120: Unused method argument: use_dp_padding
(ARG002)
121-121: Unused method argument: post_quant_dispatch
(ARG002)
122-122: Unused method argument: kwargs
(ARG002)
130-130: Avoid specifying long messages outside the exception class
(TRY003)
165-165: Unused method argument: payload_in_workspace
(ARG002)
167-167: Unused method argument: kwargs
(ARG002)
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py
60-60: Unused static method argument: mapping
(ARG004)
78-78: Unused method argument: all_rank_num_tokens
(ARG002)
91-91: Unused method argument: use_dp_padding
(ARG002)
93-93: Unused method argument: kwargs
(ARG002)
142-142: Unpacked variable num_recv_tokens_per_expert_list is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
175-175: Unused method argument: payload_in_workspace
(ARG002)
176-176: Unused method argument: alltoall_result_do_sum
(ARG002)
177-177: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Check PR Checklist Resolution
- GitHub Check: Pre-commit Check
🔇 Additional comments (22)
tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py (1)
18-23: Fix the missingAllToAllMethodTypeexport.
AllToAllMethodTypeisn’t defined incommunication.base, so this import raises at module import time. Either define and export the type inbase.pyor drop it from the public API so the package initializer loads correctly.⛔ Skipped due to learnings
Learnt from: 2ez4bz Repo: NVIDIA/TensorRT-LLM PR: 7013 File: tests/unittest/_torch/models/checkpoints/hf/test_weight_loader.py:5-5 Timestamp: 2025-08-18T23:20:34.467Z Learning: When analyzing import paths in Python packages, always verify that __init__.py files actually re-export the classes before claiming an import will fail. Check both the import statement and the __all__ list.tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py (7)
11-21: LGTM!The imports are well-organized and follow the project's namespace conventions.
23-63: LGTM!The class initialization is well-structured, follows the base Communication interface, and properly initializes MNNVL workspaces. The environment variable pattern for post-quant configuration is consistent with other communication strategies.
64-69: LGTM!The hardware capability check correctly delegates to the MNNVL utility function.
71-79: LGTM!The post-quant support check correctly returns the environment-driven flag. The
can_usemethod appropriately defers to hardware capability checks. The unused parameters are part of the abstract interface signature and can be safely ignored.
81-111: LGTM!The prepare phase correctly gathers statistics via MNNVL utilities and stores the alltoall_info for the dispatch phase. The state management pattern is appropriate.
113-160: LGTM!The dispatch logic correctly implements the MNNVL AllToAll pattern, properly validates that
prepare_dispatchwas called first, and manages state appropriately. The unused parameters are part of the abstract interface signature. The explicit error message on line 130, while flagged by static analysis, provides valuable debugging context.
162-187: LGTM!The combine phase correctly uses the stored dispatch state to reconstruct final hidden states via MNNVL utilities. The defensive list handling is appropriate, and unused parameters are part of the abstract interface.
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py (8)
8-18: LGTM!The imports are properly organized and follow project conventions.
68-88: LGTM!The capability checks correctly validate environment configuration, library installation, and dtype requirements. The post-quant dispatch support appropriately checks for supported quantization modes (NVFP4, FP8 QDQ, W4AFP8). The unused
mappingparameter inis_supportedmaintains signature consistency with other strategies.
90-97: LGTM!The runtime validation correctly enforces the token limit and single-chunk constraint for low-latency DeepEP.
99-141: LGTM!The pre-quantization dispatch path correctly uses the low-latency buffer, adapts outputs for the fused MoE interface, and stores necessary state for the combine phase. The assertion provides an early failure check. Unused parameters are part of the interface signature.
145-158: LGTM!The FP8 QDQ post-quant dispatch correctly handles dtype conversions (viewing as bfloat16 for dispatch, then back to float8_e4m3fn). The assertions provide good validation.
159-201: LGTM!The NVFP4 post-quant dispatch thoroughly validates tensor shapes and dtypes before and after dispatch. The shape calculations for quantized data (hidden_size // 2) and scale factors (hidden_size // 16) are correct for NVFP4 format.
202-236: LGTM!The W4AFP8 post-quant dispatch correctly applies pre-quantization scaling and handles dtype conversions. The defensive ValueError for unsupported quantization modes is appropriate. The explicit error message, while flagged by static analysis, provides valuable debugging context.
238-343: LGTM!The combine phase correctly handles both standard and low-precision paths, properly extracting required data from dispatch state and kwargs. The helper methods for quantization mode detection are clean and reusable. The
_modify_output_to_adapt_fused_moeadapter appropriately reshapes DeepEP outputs for the fused MoE interface, and the TODO comment on line 316 acknowledges this is temporary. Unused parameters are part of the interface signature.tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (6)
8-18: LGTM!The imports are properly organized and follow project conventions.
20-58: LGTM!The class initialization properly extends the Communication base interface and initializes the DeepEP buffer. The initialization is cleaner than DeepEPLowLatency, avoiding runtime environment variable modification.
59-82: LGTM!The capability checks correctly validate environment requirements. The post-quant dispatch support is appropriately restricted to NVFP4-only, which differs from DeepEPLowLatency's broader support (FP8 QDQ, NVFP4, W4AFP8). Unused parameters maintain interface consistency.
84-127: LGTM!The pre-quantization dispatch correctly uses the standard DeepEP buffer dispatch, applies padding for empty receives, and stores necessary state for the combine phase. Unused parameters are part of the interface signature.
172-188: LGTM!The combine phase correctly uses stored dispatch state to unpad and combine final hidden states. Unused parameters are part of the interface signature.
190-226: LGTM!The padding helpers correctly handle the edge case of empty receives, which can cause issues with zero-size tensors. The padding values (using
num_slotsfor expert IDs and ones for scales) are appropriate dummy values, and the unpadding logic correctly reverses the operation.
tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py
Show resolved
Hide resolved
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
Show resolved
Hide resolved
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
Show resolved
Hide resolved
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
Show resolved
Hide resolved
29d4541 to
49fa9dd
Compare
|
PR_Github #23724 [ run ] completed with state |
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
Outdated
Show resolved
Hide resolved
QiJune
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
49fa9dd to
70f5e76
Compare
|
/bot run |
|
PR_Github #23965 [ run ] triggered by Bot. Commit: |
70f5e76 to
352ad99
Compare
|
/bot run |
|
PR_Github #23972 [ run ] triggered by Bot. Commit: |
|
PR_Github #23965 [ run ] completed with state |
|
PR_Github #23972 [ run ] completed with state |
|
/bot run |
|
/bot run |
|
PR_Github #23984 [ run ] triggered by Bot. Commit: |
|
PR_Github #23987 [ run ] triggered by Bot. Commit: |
|
PR_Github #23984 [ run ] completed with state |
|
PR_Github #24757 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #24807 [ run ] triggered by Bot. Commit: |
|
PR_Github #24757 [ run ] completed with state |
|
PR_Github #24807 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #24950 [ run ] triggered by Bot. Commit: |
|
PR_Github #24950 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #24990 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #25033 [ run ] triggered by Bot. Commit: |
|
PR_Github #25033 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #25106 [ run ] triggered by Bot. Commit: |
|
PR_Github #25106 [ run ] completed with state |
|
/bot run |
|
PR_Github #25144 [ run ] triggered by Bot. Commit: |
|
PR_Github #25144 [ run ] completed with state |
|
/bot run |
|
PR_Github #25187 [ run ] triggered by Bot. Commit: |
|
PR_Github #25187 [ run ] completed with state |
|
/bot run |
|
PR_Github #25251 [ run ] triggered by Bot. Commit: |
|
PR_Github #25251 [ run ] completed with state |
|
/bot skip --comment "the pipeline has passed" |
|
PR_Github #25306 [ skip ] triggered by Bot. Commit: |
|
PR_Github #25306 [ skip ] completed with state |
Description
This is the sub-task in refactoring the MoE module.
Create communication classes as the first step, and then we can replace what is used in the different MOE backends so that the implementation details can be hidden.
The corresponding test case is tracked in TRTLLM-9110.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.