-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][chore] Mass integration of release/1.0 - 5th #7640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f4de070 to
19d6afa
Compare
📝 WalkthroughWalkthroughDocumentation updates add FP4 and Blackwell details and troubleshooting fixes. CI/test lists adjusted for GB200 and RTX PRO 6000. Core changes include buffer zeroing in attention, SHM/env handling in MoE load balancer, lazy dummy-KV initialization and warning tweak in resource management, distributed HF download/build in LLM API, Gemma normalization params update, and several test additions/skips. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User Process
participant C as CachedModelLoader
participant MPI as mpi_session
participant W as Worker Nodes
rect rgb(240,248,255)
note over C: HF model load (multi-GPU aware)
U->>C: load_hf_model(model, revision)
alt parallel_config.is_multi_gpu
C->>MPI: submit_sync(_node_download_hf_model, model, revision)
MPI-->>C: List[Path|None] (per-rank results)
C->>C: select first non-None Path
else single GPU
C->>C: _node_download_hf_model(model, revision)
end
C-->>U: hf_model_dir Path
end
rect rgb(245,255,250)
note over C: Per-node engine build orchestration
U->>C: build(model_args,...)
alt multi-GPU
C->>MPI: submit_sync(_node_build_task, llm_args,...)
MPI-->>C: List[build_steps_info]
else single GPU
C->>C: _node_build_task(llm_args,...)
end
C-->>U: build result(s)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Pre-merge checks (1 passed, 2 warnings)❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_gentp2.yaml (1)
1-38: Explicitly setenable_block_reusein allkv_cache_configblocks
Addenable_block_reuse: Falseunder thegeneration_servers.kv_cache_configin
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_gentp2.yaml, and audit the other disagg test configs to ensure everykv_cache_configblock includes an explicitenable_block_reusekey.tensorrt_llm/_torch/pyexecutor/_util.py (1)
170-179: Avoid double-counting draft tokens in KV capacity estimation.
spec_cfg.max_draft_lenis added twice (once under overlap-scheduler branch and again unconditionally), which inflates KV estimates and may reduce usable capacity.- if not pytorch_backend_config.disable_overlap_scheduler: - num_extra_tokens_per_seq = num_extra_tokens_per_seq + 1 - if spec_cfg is not None: - num_extra_tokens_per_seq += spec_cfg.max_draft_len - - if spec_cfg is not None: - num_extra_tokens_per_seq += spec_cfg.max_draft_len - num_extra_tokens_per_seq += get_num_extra_kv_tokens(spec_cfg) + if not pytorch_backend_config.disable_overlap_scheduler: + num_extra_tokens_per_seq += 1 + + if spec_cfg is not None: + num_extra_tokens_per_seq += spec_cfg.max_draft_len + num_extra_tokens_per_seq += get_num_extra_kv_tokens(spec_cfg)
🧹 Nitpick comments (14)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
555-558: Validate fraction range.Consider asserting lower bound too to catch 0 or negative values:
0.0 < free_mem_fraction < 1.0.- assert free_mem_fraction < 1.0, f"Invalid freeMemFraction, freeMemFraction {free_mem_fraction} must be smaller than 1.0" + assert 0.0 < free_mem_fraction < 1.0, ( + f"Invalid freeMemFraction: {free_mem_fraction}. Must be in (0.0, 1.0)." + )docs/source/reference/support-matrix.md (1)
162-162: SM designations are valid; small terminology tweak optional.“Blackwell (SM100/SM120)” aligns with CUDA 12.8 notes (SM_100/SM_120). Consider using “NVFP4” instead of “FP4” for precision naming consistency with code. (docs.nvidia.com)
docs/source/overview.md (1)
27-29: Nit: spacing and terminology.
- Remove the stray space before the comma after the B200 link.
- Prefer “NVFP4” (brand) once, then “FP4” thereafter, to match code/constants.
- [NVIDIA B200 GPUs](https://www.nvidia.com/en-us/data-center/dgx-b200/) , when used with TensorRT-LLM, enable seamless loading of model weights in the new [FP4 format](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/#what_is_nvfp4), allowing you to automatically leverage optimized FP4 kernels for efficient and accurate low-precision inference. + [NVIDIA B200 GPUs](https://www.nvidia.com/en-us/data-center/dgx-b200/), when used with TensorRT-LLM, enable seamless loading of model weights in the new [NVFP4 format](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/#what_is_nvfp4), allowing you to automatically leverage optimized FP4 kernels for efficient and accurate low-precision inference.tensorrt_llm/_torch/attention_backend/trtllm.py (1)
846-851: Zero only the written slices to avoid unnecessary work.Full-buffer
fill_can be sizable; zero the relevant regions instead and preferzero_()for clarity.- self.kv_block_ids_per_seq.fill_(0) - self.kv_block_ids_per_seq[:self.num_seqs, :num_blocks].copy_( + self.kv_block_ids_per_seq[:self.num_seqs, :num_blocks].zero_() + self.kv_block_ids_per_seq[:self.num_seqs, :num_blocks].copy_( block_ids_per_seq, non_blocking=True) - self.block_ids_per_seq.fill_(0) - self.block_ids_per_seq[:self.num_generations, :num_blocks].copy_( + self.block_ids_per_seq[:self.num_generations, :num_blocks].zero_() + self.block_ids_per_seq[:self.num_generations, :num_blocks].copy_( block_ids_per_seq[self.num_contexts:], non_blocking=True)tests/unittest/_torch/speculative/test_eagle3.py (1)
331-332: Optional: avoid running unittest main when skipping heavy tests.Since pytest drives CI, dropping the
__main__block reduces accidental local execution. Keep if needed for local workflows.tensorrt_llm/models/gemma/model.py (1)
141-146: Guard tuple-unpacking of hidden_states under reduce_fusion.The FIXME hints potential shape mismatches. Add a defensive check to avoid crashes when upstream doesn’t return a
(hidden_states, residual)tuple for some ranks/configs.- if default_net( - ).plugin_config.reduce_fusion and self.local_layer_idx > 0: - hidden_states, residual = hidden_states #FIXME:AN need to check if appropriate residual value is hidden state is pulled out. + if default_net().plugin_config.reduce_fusion and self.local_layer_idx > 0: + if isinstance(hidden_states, tuple): + hidden_states, residual = hidden_states + else: + # Fallback to non-fused contract + residual = hidden_states + hidden_states = self.input_layernorm(hidden_states) else: residual = hidden_states hidden_states = self.input_layernorm(hidden_states)tests/unittest/llmapi/test_mpi_session.py (2)
60-66: Resolve path once; reuse for cwd to avoid recomputing.Minor cleanup: reuse
cur_dirforcwdinstead of recomputingos.path.dirname(os.path.abspath(__file__)).- with Popen(command, + with Popen(command, env=os.environ, stdout=PIPE, stderr=PIPE, bufsize=1, start_new_session=True, universal_newlines=True, - cwd=os.path.dirname(os.path.abspath(__file__))) as process: + cwd=cur_dir) as process:
62-62: Silence Ruff S101 for test assertion (intended in tests).Keep the fast-fail assert but add a local noqa to satisfy Ruff.
- assert os.path.exists(test_file), f"Test file {test_file} does not exist" + assert os.path.exists(test_file), f"Test file {test_file} does not exist" # noqa: S101tensorrt_llm/_torch/pyexecutor/_util.py (1)
181-184: Lazy-init dummy requests here is good; also guard in configure_kv_cache_capacity.Some callers might jump straight to
configure_kv_cache_capacity(); add a safety init there to avoidNoneusage.@@ - py_executor.set_gather_responses(True) + # Ensure dummy requests exist even if try_prepare_estimation() was skipped. + if self._dummy_reqs is None: + self._dummy_reqs = self._create_dummy_context_requests( + max(1, self._max_seq_len - 1) + ) + py_executor.set_gather_responses(True)docs/source/llm-api/index.md (3)
68-68: Fix markdownlint MD026: remove trailing punctuation from headingDrop the period at the end of this heading to satisfy MD026.
-### MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1. +### MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1
78-79: Grammar/typo fixes in guidances/an contextmanager/a context manager/ and s/methed/method/.
- 2. Use LLM as an contextmanager, with the following code: `with LLM(...) as llm: ...`, the shutdown methed will be invoked automatically once it goes out of the `with`-statement block. + 2. Use LLM as a context manager, with the following code: `with LLM(...) as llm: ...`. The shutdown method will be invoked automatically once it exits the `with` block.
82-93: Clarify the docker networking workaround scopeThe mpi4py workaround suggests avoiding --net=host; note explicitly that --ipc=host is the required flag, and keep --net default. This helps users not set both.
-The root cause may be related to `mpi4py`. There is a [workaround](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) suggesting a change from `--net=host` to `--ipc=host`, or setting the following environment variables: +The root cause may be related to `mpi4py`. A documented [workaround](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) is to avoid `--net=host` and use `--ipc=host` instead, or set the following environment variables:jenkins/L0_Test.groovy (1)
2105-2107: Avoid magic number 7 for post-merge splitsMake split count a named constant to keep Jenkins and test-db YAML in sync and ease future changes.
- multiNodesSBSAConfigs += (1..7).collectEntries { i -> - ["GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-${i}".toString(), ["gb200-multi-node", "l0_gb200_multi_nodes", i, 7, 8, 2]] - } + // Use a single source of truth for split count. + final int POST_MERGE_SPLITS = 7 + multiNodesSBSAConfigs += (1..POST_MERGE_SPLITS).collectEntries { i -> + ["GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-${i}".toString(), + ["gb200-multi-node", "l0_gb200_multi_nodes", i, POST_MERGE_SPLITS, 8, 2]] + }If preferred, I can place POST_MERGE_SPLITS next to other @field constants.
tensorrt_llm/llmapi/llm_utils.py (1)
769-771: Don’t delete attributes from llm_argsDeleting the mpi_session attribute mutates the schema and risks surprises for later users. Prefer setting it to None.
- if self.llm_args.mpi_session: - del self.llm_args.mpi_session + if getattr(self.llm_args, "mpi_session", None) is not None: + self.llm_args.mpi_session = None
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
docs/source/llm-api/index.md(2 hunks)docs/source/overview.md(1 hunks)docs/source/reference/support-matrix.md(1 hunks)jenkins/L0_Test.groovy(1 hunks)tensorrt_llm/_torch/attention_backend/trtllm.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py(4 hunks)tensorrt_llm/_torch/pyexecutor/_util.py(3 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(1 hunks)tensorrt_llm/llmapi/llm_utils.py(4 hunks)tensorrt_llm/models/gemma/model.py(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py(1 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_genpp2.yaml(2 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_gentp2.yaml(1 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_genpp4.yaml(2 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2pp2_gentp2pp2.yaml(2 hunks)tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml(1 hunks)tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml(0 hunks)tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py(0 hunks)tests/unittest/_torch/speculative/test_eagle3.py(1 hunks)tests/unittest/llmapi/test_mpi_session.py(1 hunks)
💤 Files with no reviewable changes (2)
- tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py
- tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tests/unittest/_torch/speculative/test_eagle3.pytensorrt_llm/_torch/attention_backend/trtllm.pytests/unittest/llmapi/test_mpi_session.pytensorrt_llm/models/gemma/model.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.pytensorrt_llm/_torch/pyexecutor/_util.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytensorrt_llm/llmapi/llm_utils.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tests/unittest/_torch/speculative/test_eagle3.pytensorrt_llm/_torch/attention_backend/trtllm.pytests/unittest/llmapi/test_mpi_session.pytensorrt_llm/models/gemma/model.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.pytensorrt_llm/_torch/pyexecutor/_util.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytensorrt_llm/llmapi/llm_utils.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tests/unittest/_torch/speculative/test_eagle3.pytensorrt_llm/_torch/attention_backend/trtllm.pytests/unittest/llmapi/test_mpi_session.pytensorrt_llm/models/gemma/model.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.pytensorrt_llm/_torch/pyexecutor/_util.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytensorrt_llm/llmapi/llm_utils.py
🧠 Learnings (6)
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
PR: NVIDIA/TensorRT-LLM#7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.
Applied to files:
tests/unittest/_torch/speculative/test_eagle3.py
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/attention_backend/trtllm.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
tensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
docs/source/overview.md
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-13T11:07:11.772Z
Learnt from: Funatiq
PR: NVIDIA/TensorRT-LLM#6754
File: tests/integration/test_lists/test-db/l0_a30.yml:41-47
Timestamp: 2025-08-13T11:07:11.772Z
Learning: In TensorRT-LLM test configuration files like tests/integration/test_lists/test-db/l0_a30.yml, TIMEOUT values are specified in minutes, not seconds.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml
🧬 Code graph analysis (5)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
tensorrt_llm/_torch/attention_backend/interface.py (1)
num_seqs(259-263)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (2)
tensorrt_llm/_torch/expert_statistic.py (1)
create(18-29)tensorrt_llm/logger.py (1)
warning(132-133)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
max_seq_len(561-571)max_seq_len(574-578)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)
tensorrt_llm/llmapi/build_cache.py (1)
is_cached(256-271)tensorrt_llm/llmapi/llm_args.py (1)
CudaGraphConfig(108-165)tensorrt_llm/llmapi/llm.py (1)
LLM(1017-1033)tests/integration/defs/conftest.py (1)
llm_models_root(77-82)
tensorrt_llm/llmapi/llm_utils.py (3)
tensorrt_llm/_utils.py (6)
global_mpi_rank(497-498)local_mpi_rank(509-510)mpi_barrier(526-528)mpi_broadcast(536-537)mpi_rank(493-494)release_gc(678-685)tensorrt_llm/llmapi/llm_args.py (2)
parallel_config(1410-1411)is_multi_gpu(279-280)tensorrt_llm/llmapi/utils.py (2)
print_traceback_on_error(31-42)download_hf_model(173-182)
🪛 Ruff (0.12.2)
tests/unittest/llmapi/test_mpi_session.py
62-62: Use of assert detected
(S101)
🪛 markdownlint-cli2 (0.17.2)
docs/source/llm-api/index.md
68-68: Trailing punctuation in heading
Punctuation: '.'
(MD026, no-trailing-punctuation)
🔇 Additional comments (11)
tests/unittest/_torch/speculative/test_eagle3.py (1)
19-19: Skip marker placement looks good; preserves CI stability.The
@pytest.mark.skipis correctly above parametrization and will skip the entire test as intended. Based on prior CI behavior, skip markers remain honored even with a__main__block.
Using retrieved learning from #7370.tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_gentp2.yaml (1)
17-17: Block reuse explicitly disabled only on context side; confirm intent.Generation side lacks
enable_block_reuse. For parity and predictable behavior across roles, consider setting it explicitly there as well (as done in other configs), or document why it differs.If parity is desired, add under
generation_servers.kv_cache_config:kv_cache_config: free_gpu_memory_fraction: 0.2 enable_partial_reuse: False + enable_block_reuse: Falsetests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_genpp2.yaml (1)
17-17: Consistency across disagg configs looks good.Disabling block reuse on both context and generation matches the stated pattern.
Also applies to: 33-33
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_genpp4.yaml (1)
17-17: LGTM — explicit disable on both roles.Matches the approach in the ctxpp2_genpp2 variant.
Also applies to: 33-33
tensorrt_llm/models/gemma/model.py (1)
160-166: Fused norm params wiring looks consistent; confirm plugin semantics.Passing pre-FFN norm weight and the post-layernorm weight via
norm_weightandnorm_pre_residual_weightmatches the MLP path pattern. Please confirmRESIDUAL_RMS_PREPOST_NORMexpects these exact roles (pre-next-layer input vs. pre-residual), to avoid silent mis-normalization.tensorrt_llm/_torch/pyexecutor/_util.py (2)
56-58: Good: defer dummy request creation; store max_seq_len for later sync.This reduces constructor side effects and helps SWA workflows. LGTM.
473-479: SWA-aware re-sync of dummy requests looks correct.Regenerating
_dummy_reqswhenmax_seq_lenshrinks avoids overlong dummy contexts post-SWA. LGTM.tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2pp2_gentp2pp2.yaml (1)
17-17: Disabling KV block reuse: confirm intent and perf impactSetting enable_block_reuse: False on both context/generation likely reduces memory reuse and could affect throughput. If this is a WAR for kernel gaps, add a brief YAML comment and link to the tracking issue; otherwise consider leaving it enabled in perf-oriented tests.
Apply this minimal note to document the rationale:
kv_cache_config: free_gpu_memory_fraction: 0.2 enable_partial_reuse: False + # WAR: disable block reuse for this scenario (see <issue/bug id>) enable_block_reuse: FalseAlso applies to: 33-33
tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml (1)
36-39: Post-merge MOE additions look good; double-check CI timingNew entries and 180-minute timeouts align with prior convention (timeouts are minutes). Please verify total wall time against the Jenkins splits introduced in L0_Test.groovy (now 7 stages) so we don’t exceed the pipeline SLA.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
2356-2367: Param adds for multi_gpus_no_cache: path selection logic LGTM; ensure HF credentialsThe is_cached switch and model path fallback are correct. For the non-cached 8-GPU case, ensure HF auth tokens are present on all workers to avoid per-rank auth failures.
Also applies to: 2369-2375
tensorrt_llm/llmapi/llm_utils.py (1)
831-841: Node-scoped HF download helper: niceLimiting downloads to local_mpi_rank()==0 reduces redundant I/O across ranks cleanly.
19d6afa to
615e784
Compare
95bd93c to
4514897
Compare
nv-guomingz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
4514897 to
00f91e1
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #18201 [ run ] triggered by Bot |
|
/bot run --disable-fail-fast |
|
PR_Github #18306 [ run ] triggered by Bot |
|
PR_Github #18306 [ run ] completed with state |
|
/bot run --disable-fail-fast |
…Qwen3 MoE multi node (NVIDIA#7724) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…VIDIA#7753) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…ywords for multimodal model (NVIDIA#7670) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…#7751) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
NVIDIA#7762) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com> Signed-off-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com> Co-authored-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…on pre-Blackwell arch (NVIDIA#7768) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
|
/bot kill |
…VIDIA#7663) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
bb4e89f to
1fcfeeb
Compare
|
PR_Github #19365 [ kill ] triggered by Bot |
|
PR_Github #19365 [ kill ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #19367 [ run ] triggered by Bot |
|
PR_Github #19367 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #19403 [ run ] triggered by Bot |
|
PR_Github #19403 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #19419 [ run ] triggered by Bot |
|
PR_Github #19419 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #19436 [ run ] triggered by Bot |
|
PR_Github #19436 [ run ] completed with state |
|
/bot skip --comment "Bypass faulty oom test." |
|
PR_Github #19505 [ skip ] triggered by Bot |
|
PR_Github #19505 [ skip ] completed with state |
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.