[TRTLLM-8957][feat] create communication related classes #8968

xxi-nv · 2025-11-06T08:04:26Z

Description

This is the sub-task in refactoring the MoE module.
Create communication classes as the first step, and then we can replace what is used in the different MOE backends so that the implementation details can be hidden.

The corresponding test case is tracked in TRTLLM-9110.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

xxi-nv · 2025-11-06T08:06:20Z

/bot run

tensorrt-cicd · 2025-11-06T08:12:35Z

PR_Github #23724 [ run ] triggered by Bot. Commit: 29d4541

coderabbitai · 2025-11-06T08:13:52Z

📝 Walkthrough

Walkthrough

Introduces a new MoE communication strategy module with a base interface, five concrete implementations (AllGatherReduceScatter, MnnvlLatency, MNNVLThroughput, DeepEP, DeepEPLowLatency), a factory for runtime strategy selection, and a package initializer exposing the public API.

Changes

Cohort / File(s)	Summary
Base Infrastructure `tensorrt_llm/_torch/modules/fused_moe/communication/base.py`, `tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py`	Defines abstract `Communication` base class with lifecycle methods (`can_use`, `prepare_dispatch`, `dispatch`, `combine`) and introduces package exports for all strategies and factory.
Factory & Selection Logic `tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py`	Implements `CommunicationFactory` with environment-driven strategy selection logic; includes helpers `is_high_throughput()` and `is_deepep_feasible()`.
Simple AllToAll Strategy `tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py`	Implements `AllGatherReduceScatter` strategy combining AllGather dispatch with ReduceScatter combine.
MNNVL Strategies `tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py`, `tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py`	Adds `MnnvlLatency` and `MNNVLThroughput` classes for MNNVL-based AllToAll communication with environment-driven post-quant dispatch support.
DeepEP Strategies `tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py`, `tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py`	Implements `DeepEP` and `DeepEPLowLatency` strategies with support for pre/post-quant dispatch paths, token padding, and multiple quantization modes (NVFP4, FP8, W4AFP8).

Sequence Diagram

sequenceDiagram
    participant Factory
    participant Strategy as Communication Strategy
    participant Dispatch as Dispatch Phase
    participant Combine as Combine Phase
    
    Factory->>Factory: Evaluate environment & mapping<br/>(topology, dtype, flags)
    Factory->>Strategy: Create strategy instance<br/>(AllGather/MNNVL/DeepEP)
    
    Strategy->>Dispatch: prepare_dispatch()<br/>(optional metadata prep)
    Dispatch->>Dispatch: Store alltoall_info or<br/>dispatch state
    Dispatch-->>Strategy: Return local statistics
    
    Strategy->>Dispatch: dispatch()<br/>(hidden_states, tokens, scales)
    Dispatch->>Dispatch: Execute AllGather/<br/>MNNVL/DeepEP<br/>operation
    Dispatch->>Dispatch: Store dispatch_state<br/>(sizes, handles, etc.)
    Dispatch-->>Strategy: Return dispatched tensors
    
    Strategy->>Combine: combine()<br/>(final_hidden_states)
    Combine->>Combine: Read dispatch_state<br/>Reverse communication<br/>(ReduceScatter/Combine)
    Combine-->>Strategy: Return final output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

Factory selection logic in communication_factory.py: Complex branching with environment variables, device topology checks, and quantization mode interactions; verify correctness of precedence and fallback paths
Dispatch/combine state management: All strategies maintain _dispatch_state across phases; ensure state is correctly initialized, populated, and consumed in the right order
Quantization-aware paths: DeepEP and DeepEPLowLatency dispatch handle multiple quant modes (NVFP4, FP8, W4AFP8) with distinct reshaping and adaptation logic; verify each path is tested and compatible
MNNVL workspace reuse: MNNVLThroughput uses shared workspace across ranks; validate thread-safety and lifecycle assumptions
Base class contract: Ensure all concrete implementations honor abstract methods and optional hooks (e.g., prepare_dispatch default behavior, supports_post_quant_dispatch override semantics)

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	PR description explains the purpose (MoE refactoring) and references tracking ticket (TRTLLM-9110), but lacks detail on what is being implemented and test coverage information.	Expand the Description section to clarify the specific communication classes being created and their intended use. Provide concrete details on test coverage for the new communication strategy implementations (AllGatherReduceScatter, MNNVL variants, DeepEP variants, and CommunicationFactory).

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title '[TRTLLM-8957][feat] create communication related classes' clearly and concisely describes the main change: creating communication-related classes for the MoE module refactoring.
Docstring Coverage	✅ Passed	Docstring coverage is 89.36% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

🧹 Nitpick comments (1)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (1)
129-170: Post-quant dispatch logic is correct; consider marking unused variable.

The post-quantization dispatch correctly handles NVFP4 with dtype adapters for hidden_states_sf, applies padding, and stores state. The TODO comment on line 134 appropriately notes the adapter is temporary.

Consider marking the unused unpacked variable with an underscore to indicate it's intentionally ignored:
         (
             (hidden_states, hidden_states_sf),
             recv_topk_idx,
             token_final_scales,
-            num_recv_tokens_per_expert_list,
+            _,  # num_recv_tokens_per_expert_list (unused)
             deep_ep_handle,
         ) = self.deep_ep_buffer.dispatch(

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b7798bf and 29d4541.

📒 Files selected for processing (8)

tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/communication/base.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
tensorrt_llm/_torch/modules/fused_moe/communication/base.py
tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py
tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
tensorrt_llm/_torch/modules/fused_moe/communication/base.py
tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py
tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
tensorrt_llm/_torch/modules/fused_moe/communication/base.py
tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py
tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py

🧠 Learnings (5)

📓 Common learnings

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py

📚 Learning: 2025-08-14T06:36:40.701Z

Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py

📚 Learning: 2025-09-02T13:42:44.885Z

Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py

🧬 Code graph analysis (8)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py (5)

tensorrt_llm/mapping.py (2)

Mapping (336-493)

moe_ep_rank (205-206)

tensorrt_llm/models/modeling_utils.py (1)

QuantConfig (131-271)

tensorrt_llm/_torch/modules/fused_moe/communication/base.py (5)

Communication (20-167)

supports_post_quant_dispatch (63-73)

can_use (55-61)

dispatch (103-142)

combine (145-167)

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (5)

get_low_latency_buffer (234-244)

low_latency_dispatch (142-158)

low_latency_dispatch_fp4 (172-183)

low_latency_combine_low_precision (185-200)

low_latency_combine (160-170)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (5)

is_supported (60-68)

supports_post_quant_dispatch (70-76)

can_use (78-82)

dispatch (84-170)

combine (172-188)

tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py (9)

tensorrt_llm/_utils.py (1)

local_mpi_size (557-558)

tensorrt_llm/mapping.py (1)

Mapping (336-493)

tensorrt_llm/models/modeling_utils.py (1)

QuantConfig (131-271)

tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py (1)

AllGatherReduceScatter (21-92)

tensorrt_llm/_torch/modules/fused_moe/communication/base.py (1)

Communication (20-167)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (2)

DeepEP (20-226)

is_supported (60-68)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py (2)

DeepEPLowLatency (20-344)

is_supported (69-79)

tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py (2)

MnnvlLatency (23-187)

is_supported (65-69)

tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py (2)

MNNVLThroughput (24-370)

is_supported (156-160)

tensorrt_llm/_torch/modules/fused_moe/communication/base.py (7)

tensorrt_llm/mapping.py (1)

moe_ep_rank (205-206)

tensorrt_llm/models/modeling_utils.py (1)

QuantConfig (131-271)

tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py (3)

can_use (47-49)

dispatch (51-77)

combine (79-92)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (4)

can_use (78-82)

supports_post_quant_dispatch (70-76)

dispatch (84-170)

combine (172-188)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py (4)

can_use (90-97)

supports_post_quant_dispatch (81-88)

dispatch (99-236)

combine (238-286)

tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py (5)

can_use (77-79)

supports_post_quant_dispatch (71-75)

prepare_dispatch (81-111)

dispatch (113-160)

combine (162-187)

tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py (4)

can_use (168-170)

supports_post_quant_dispatch (162-166)

dispatch (172-266)

combine (268-336)

tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py (7)

tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py (1)

AllGatherReduceScatter (21-92)

tensorrt_llm/_torch/modules/fused_moe/communication/base.py (1)

Communication (20-167)

tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py (1)

CommunicationFactory (56-315)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (1)

DeepEP (20-226)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py (1)

DeepEPLowLatency (20-344)

tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py (1)

MnnvlLatency (23-187)

tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py (1)

MNNVLThroughput (24-370)

tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py (3)

tensorrt_llm/mapping.py (1)

Mapping (336-493)

tensorrt_llm/models/modeling_utils.py (1)

QuantConfig (131-271)

tensorrt_llm/_torch/modules/fused_moe/communication/base.py (4)

Communication (20-167)

can_use (55-61)

dispatch (103-142)

combine (145-167)

tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py (4)

tensorrt_llm/_mnnvl_utils.py (3)

MnnvlMemory (53-338)

as_torch_strided_tensor (84-88)

supports_mnnvl (332-338)

tensorrt_llm/models/modeling_utils.py (1)

QuantConfig (131-271)

tensorrt_llm/_torch/modules/fused_moe/communication/base.py (5)

Communication (20-167)

supports_post_quant_dispatch (63-73)

can_use (55-61)

dispatch (103-142)

combine (145-167)

tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py (5)

is_supported (65-69)

supports_post_quant_dispatch (71-75)

can_use (77-79)

dispatch (113-160)

combine (162-187)

tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py (2)

tensorrt_llm/_mnnvl_utils.py (8)

MnnvlMemory (53-338)

MnnvlMoe (352-624)

get_moe_workspaces (360-376)

get_moe_prepare_workspace (379-390)

supports_mnnvl (332-338)

mnnvl_moe_alltoallv_prepare_without_allgather (402-446)

mnnvl_moe_alltoallv (531-592)

mnnvl_moe_alltoallv_combine (595-624)

tensorrt_llm/_torch/modules/fused_moe/communication/base.py (4)

Communication (20-167)

supports_post_quant_dispatch (63-73)

dispatch (103-142)

combine (145-167)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (4)

tensorrt_llm/mapping.py (1)

Mapping (336-493)

tensorrt_llm/models/modeling_utils.py (1)

QuantConfig (131-271)

tensorrt_llm/_torch/modules/fused_moe/communication/base.py (4)

supports_post_quant_dispatch (63-73)

can_use (55-61)

dispatch (103-142)

combine (145-167)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py (5)

is_supported (69-79)

supports_post_quant_dispatch (81-88)

can_use (90-97)

dispatch (99-236)

combine (238-286)

🪛 Ruff (0.14.3)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py

69-69: Unused static method argument: mapping

(ARG004)

106-106: Unused method argument: use_dp_padding

(ARG002)

109-109: Unused method argument: kwargs

(ARG002)

220-220: Avoid specifying long messages outside the exception class

(TRY003)

241-241: Unused method argument: payload_in_workspace

(ARG002)

242-242: Unused method argument: alltoall_result_do_sum

(ARG002)

tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py

315-315: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/_torch/modules/fused_moe/communication/base.py

77-77: Unused method argument: token_selected_slots

(ARG002)

78-78: Unused method argument: all_rank_num_tokens

(ARG002)

79-79: Unused method argument: local_statistic_tensor

(ARG002)

tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py

25-37: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py

47-47: Unused method argument: all_rank_num_tokens

(ARG002)

47-47: Unused method argument: num_chunks

(ARG002)

59-59: Unused method argument: post_quant_dispatch

(ARG002)

60-60: Unused method argument: kwargs

(ARG002)

82-82: Unused method argument: payload_in_workspace

(ARG002)

83-83: Unused method argument: alltoall_result_do_sum

(ARG002)

84-84: Unused method argument: kwargs

(ARG002)

tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py

168-168: Unused method argument: all_rank_num_tokens

(ARG002)

168-168: Unused method argument: num_chunks

(ARG002)

178-178: Unused method argument: all_rank_num_tokens

(ARG002)

179-179: Unused method argument: use_dp_padding

(ARG002)

180-180: Unused method argument: post_quant_dispatch

(ARG002)

181-181: Unused method argument: kwargs

(ARG002)

201-201: Avoid specifying long messages outside the exception class

(TRY003)

272-272: Unused method argument: alltoall_result_do_sum

(ARG002)

273-273: Unused method argument: kwargs

(ARG002)

290-290: Avoid specifying long messages outside the exception class

(TRY003)

299-299: Avoid specifying long messages outside the exception class

(TRY003)

313-315: Avoid specifying long messages outside the exception class

(TRY003)

354-356: Avoid specifying long messages outside the exception class

(TRY003)

360-360: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py

77-77: Unused method argument: all_rank_num_tokens

(ARG002)

77-77: Unused method argument: num_chunks

(ARG002)

120-120: Unused method argument: use_dp_padding

(ARG002)

121-121: Unused method argument: post_quant_dispatch

(ARG002)

122-122: Unused method argument: kwargs

(ARG002)

130-130: Avoid specifying long messages outside the exception class

(TRY003)

165-165: Unused method argument: payload_in_workspace

(ARG002)

167-167: Unused method argument: kwargs

(ARG002)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py

60-60: Unused static method argument: mapping

(ARG004)

78-78: Unused method argument: all_rank_num_tokens

(ARG002)

91-91: Unused method argument: use_dp_padding

(ARG002)

93-93: Unused method argument: kwargs

(ARG002)

142-142: Unpacked variable num_recv_tokens_per_expert_list is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

175-175: Unused method argument: payload_in_workspace

(ARG002)

176-176: Unused method argument: alltoall_result_do_sum

(ARG002)

177-177: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Check PR Checklist Resolution
GitHub Check: Pre-commit Check

🔇 Additional comments (22)

tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py (1)
18-23: Fix the missing AllToAllMethodType export.

AllToAllMethodType isn’t defined in communication.base, so this import raises at module import time. Either define and export the type in base.py or drop it from the public API so the package initializer loads correctly.
⛔ Skipped due to learnings
Learnt from: 2ez4bz
Repo: NVIDIA/TensorRT-LLM PR: 7013
File: tests/unittest/_torch/models/checkpoints/hf/test_weight_loader.py:5-5
Timestamp: 2025-08-18T23:20:34.467Z
Learning: When analyzing import paths in Python packages, always verify that __init__.py files actually re-export the classes before claiming an import will fail. Check both the import statement and the __all__ list.
tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py (7)

11-21: LGTM!

The imports are well-organized and follow the project's namespace conventions.

23-63: LGTM!

The class initialization is well-structured, follows the base Communication interface, and properly initializes MNNVL workspaces. The environment variable pattern for post-quant configuration is consistent with other communication strategies.

64-69: LGTM!

The hardware capability check correctly delegates to the MNNVL utility function.

71-79: LGTM!

The post-quant support check correctly returns the environment-driven flag. The can_use method appropriately defers to hardware capability checks. The unused parameters are part of the abstract interface signature and can be safely ignored.

81-111: LGTM!

The prepare phase correctly gathers statistics via MNNVL utilities and stores the alltoall_info for the dispatch phase. The state management pattern is appropriate.

113-160: LGTM!

The dispatch logic correctly implements the MNNVL AllToAll pattern, properly validates that prepare_dispatch was called first, and manages state appropriately. The unused parameters are part of the abstract interface signature. The explicit error message on line 130, while flagged by static analysis, provides valuable debugging context.

162-187: LGTM!

The combine phase correctly uses the stored dispatch state to reconstruct final hidden states via MNNVL utilities. The defensive list handling is appropriate, and unused parameters are part of the abstract interface.

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py (8)

8-18: LGTM!

The imports are properly organized and follow project conventions.

68-88: LGTM!

The capability checks correctly validate environment configuration, library installation, and dtype requirements. The post-quant dispatch support appropriately checks for supported quantization modes (NVFP4, FP8 QDQ, W4AFP8). The unused mapping parameter in is_supported maintains signature consistency with other strategies.

90-97: LGTM!

The runtime validation correctly enforces the token limit and single-chunk constraint for low-latency DeepEP.

99-141: LGTM!

The pre-quantization dispatch path correctly uses the low-latency buffer, adapts outputs for the fused MoE interface, and stores necessary state for the combine phase. The assertion provides an early failure check. Unused parameters are part of the interface signature.

145-158: LGTM!

The FP8 QDQ post-quant dispatch correctly handles dtype conversions (viewing as bfloat16 for dispatch, then back to float8_e4m3fn). The assertions provide good validation.

159-201: LGTM!

The NVFP4 post-quant dispatch thoroughly validates tensor shapes and dtypes before and after dispatch. The shape calculations for quantized data (hidden_size // 2) and scale factors (hidden_size // 16) are correct for NVFP4 format.

202-236: LGTM!

The W4AFP8 post-quant dispatch correctly applies pre-quantization scaling and handles dtype conversions. The defensive ValueError for unsupported quantization modes is appropriate. The explicit error message, while flagged by static analysis, provides valuable debugging context.

238-343: LGTM!

The combine phase correctly handles both standard and low-precision paths, properly extracting required data from dispatch state and kwargs. The helper methods for quantization mode detection are clean and reusable. The _modify_output_to_adapt_fused_moe adapter appropriately reshapes DeepEP outputs for the fused MoE interface, and the TODO comment on line 316 acknowledges this is temporary. Unused parameters are part of the interface signature.

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (6)

8-18: LGTM!

The imports are properly organized and follow project conventions.

20-58: LGTM!

The class initialization properly extends the Communication base interface and initializes the DeepEP buffer. The initialization is cleaner than DeepEPLowLatency, avoiding runtime environment variable modification.

59-82: LGTM!

The capability checks correctly validate environment requirements. The post-quant dispatch support is appropriately restricted to NVFP4-only, which differs from DeepEPLowLatency's broader support (FP8 QDQ, NVFP4, W4AFP8). Unused parameters maintain interface consistency.

84-127: LGTM!

The pre-quantization dispatch correctly uses the standard DeepEP buffer dispatch, applies padding for empty receives, and stores necessary state for the combine phase. Unused parameters are part of the interface signature.

172-188: LGTM!

The combine phase correctly uses stored dispatch state to unpad and combine final hidden states. Unused parameters are part of the interface signature.

190-226: LGTM!

The padding helpers correctly handle the edge case of empty receives, which can cause issues with zero-size tensors. The padding values (using num_slots for expert IDs and ones for scales) are appropriate dummy values, and the unpadding logic correctly reverses the operation.

tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py

tensorrt_llm/_torch/modules/fused_moe/communication/allgather_reducescatter.py

tensorrt_llm/_torch/modules/fused_moe/communication/base.py

tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py

tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_latency.py

tensorrt_llm/_torch/modules/fused_moe/communication/mnnvl_throughput.py

tensorrt-cicd · 2025-11-06T19:59:38Z

PR_Github #23724 [ run ] completed with state SUCCESS. Commit: 29d4541
/LLM/main/L0_MergeRequest_PR pipeline #17855 completed with status: 'FAILURE'

tensorrt_llm/_torch/modules/fused_moe/communication/base.py

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py

QiJune

LGTM

xxi-nv · 2025-11-10T05:54:41Z

/bot run

tensorrt-cicd · 2025-11-10T06:00:05Z

PR_Github #23965 [ run ] triggered by Bot. Commit: 70f5e76

xxi-nv · 2025-11-10T06:21:48Z

/bot run

tensorrt-cicd · 2025-11-10T06:27:46Z

PR_Github #23972 [ run ] triggered by Bot. Commit: 352ad99

tensorrt-cicd · 2025-11-10T06:27:48Z

PR_Github #23965 [ run ] completed with state ABORTED. Commit: 70f5e76
LLM/main/L0_MergeRequest_PR #18048 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-11-10T07:39:12Z

PR_Github #23972 [ run ] completed with state SUCCESS. Commit: 352ad99
/LLM/main/L0_MergeRequest_PR pipeline #18053 completed with status: 'FAILURE'

xxi-nv · 2025-11-10T07:48:58Z

/bot run

xxi-nv · 2025-11-10T07:54:07Z

/bot run

tensorrt-cicd · 2025-11-10T07:54:36Z

PR_Github #23984 [ run ] triggered by Bot. Commit: 0c6db47

tensorrt-cicd · 2025-11-10T07:59:20Z

PR_Github #23987 [ run ] triggered by Bot. Commit: 0c6db47

tensorrt-cicd · 2025-11-10T07:59:22Z

PR_Github #23984 [ run ] completed with state ABORTED. Commit: 0c6db47

tensorrt-cicd · 2025-11-17T11:28:02Z

PR_Github #24757 [ run ] triggered by Bot. Commit: a824e31

xxi-nv · 2025-11-17T23:30:30Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-17T23:36:40Z

PR_Github #24807 [ run ] triggered by Bot. Commit: a824e31

tensorrt-cicd · 2025-11-17T23:36:42Z

PR_Github #24757 [ run ] completed with state ABORTED. Commit: a824e31
LLM/main/L0_MergeRequest_PR #18674 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-11-18T19:08:03Z

PR_Github #24807 [ run ] completed with state SUCCESS. Commit: a824e31
/LLM/main/L0_MergeRequest_PR pipeline #18720 completed with status: 'FAILURE'

xxi-nv · 2025-11-18T23:29:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-18T23:35:11Z

PR_Github #24950 [ run ] triggered by Bot. Commit: a824e31

tensorrt-cicd · 2025-11-19T05:52:45Z

PR_Github #24950 [ run ] completed with state FAILURE. Commit: a824e31
/LLM/main/L0_MergeRequest_PR pipeline #18848 completed with status: 'FAILURE'

xxi-nv · 2025-11-19T05:54:57Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-19T06:00:24Z

PR_Github #24990 [ run ] triggered by Bot. Commit: a824e31

xxi-nv · 2025-11-19T09:00:21Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-19T09:06:13Z

PR_Github #25033 [ run ] triggered by Bot. Commit: a824e31

tensorrt-cicd · 2025-11-19T19:21:32Z

PR_Github #25033 [ run ] completed with state SUCCESS. Commit: a824e31
/LLM/main/L0_MergeRequest_PR pipeline #18915 completed with status: 'FAILURE'

xxi-nv · 2025-11-19T23:28:58Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-11-19T23:34:30Z

PR_Github #25106 [ run ] triggered by Bot. Commit: a824e31

tensorrt-cicd · 2025-11-20T03:40:49Z

PR_Github #25106 [ run ] completed with state SUCCESS. Commit: a824e31
/LLM/main/L0_MergeRequest_PR pipeline #18979 completed with status: 'FAILURE'

xxi-nv · 2025-11-20T03:48:25Z

/bot run

tensorrt-cicd · 2025-11-20T03:53:34Z

PR_Github #25144 [ run ] triggered by Bot. Commit: a824e31

tensorrt-cicd · 2025-11-20T09:30:56Z

PR_Github #25144 [ run ] completed with state SUCCESS. Commit: a824e31
/LLM/main/L0_MergeRequest_PR pipeline #19011 completed with status: 'FAILURE'

xxi-nv · 2025-11-20T09:47:00Z

/bot run

tensorrt-cicd · 2025-11-20T09:53:08Z

PR_Github #25187 [ run ] triggered by Bot. Commit: a824e31

tensorrt-cicd · 2025-11-20T17:12:12Z

PR_Github #25187 [ run ] completed with state SUCCESS. Commit: a824e31
/LLM/main/L0_MergeRequest_PR pipeline #19044 completed with status: 'FAILURE'

xxi-nv · 2025-11-20T23:23:12Z

/bot run

tensorrt-cicd · 2025-11-20T23:28:43Z

PR_Github #25251 [ run ] triggered by Bot. Commit: a824e31

tensorrt-cicd · 2025-11-21T04:32:10Z

PR_Github #25251 [ run ] completed with state SUCCESS. Commit: a824e31
/LLM/main/L0_MergeRequest_PR pipeline #19099 completed with status: 'SUCCESS'

kaiyux · 2025-11-21T05:43:40Z

/bot skip --comment "the pipeline has passed"

tensorrt-cicd · 2025-11-21T05:49:59Z

PR_Github #25306 [ skip ] triggered by Bot. Commit: 6c47d84

tensorrt-cicd · 2025-11-21T06:32:40Z

PR_Github #25306 [ skip ] completed with state SUCCESS. Commit: 6c47d84
Skipping testing for commit 6c47d84

xxi-nv requested a review from a team as a code owner November 6, 2025 08:04

xxi-nv requested review from HuiGao-NV, bobboli, dongxuy04 and yuantailing November 6, 2025 08:04

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

xxi-nv force-pushed the CommunicationMoE branch from 29d4541 to 49fa9dd Compare November 6, 2025 08:50

xxi-nv requested a review from QiJune November 10, 2025 01:30

QiJune reviewed Nov 10, 2025

View reviewed changes

tensorrt_llm/_torch/modules/fused_moe/communication/base.py Outdated Show resolved Hide resolved

QiJune reviewed Nov 10, 2025

View reviewed changes

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py Outdated Show resolved Hide resolved

QiJune approved these changes Nov 10, 2025

View reviewed changes

xxi-nv force-pushed the CommunicationMoE branch from 49fa9dd to 70f5e76 Compare November 10, 2025 05:51

xxi-nv force-pushed the CommunicationMoE branch from 70f5e76 to 352ad99 Compare November 10, 2025 06:21

xxi-nv enabled auto-merge (squash) November 10, 2025 07:52

xxi-nv disabled auto-merge November 10, 2025 09:19

Merge branch 'main' into CommunicationMoE

6c47d84

xxi-nv merged commit cc0dc7c into NVIDIA:main Nov 21, 2025
5 checks passed

[TRTLLM-8957][feat] create communication related classes #8968

[TRTLLM-8957][feat] create communication related classes #8968

Uh oh!

Conversation

xxi-nv commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

xxi-nv commented Nov 6, 2025

Uh oh!

tensorrt-cicd commented Nov 6, 2025

Uh oh!

coderabbitai bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Nov 6, 2025

Uh oh!

Uh oh!

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

xxi-nv commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

xxi-nv commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

xxi-nv commented Nov 10, 2025

Uh oh!

xxi-nv commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 17, 2025

Uh oh!

xxi-nv commented Nov 17, 2025

Uh oh!

tensorrt-cicd commented Nov 17, 2025

Uh oh!

tensorrt-cicd commented Nov 17, 2025

Uh oh!

tensorrt-cicd commented Nov 18, 2025

Uh oh!

xxi-nv commented Nov 18, 2025

Uh oh!

xxi-nv commented Nov 6, 2025 •

edited

Loading

coderabbitai bot commented Nov 6, 2025 •

edited

Loading