[TRTLLM-7843][feat] implement disagg cluster auto-scaling #8215

reasonsolo · 2025-10-09T06:28:30Z

Summary by CodeRabbit

New Features
- Added disaggregated, cluster-aware serving with dynamic worker discovery and auto-scaling.
- New CLI option: --disagg_cluster_uri for cluster configuration.
- Readiness-gated requests and health checks; server refuses traffic until ready.
- New /cluster_info endpoint exposing cluster status and readiness.
- Router improvements to add/remove servers at runtime for seamless scaling.
- Validation to ensure appropriate server role when using cluster or metadata config.
Tests
- New end-to-end integration tests for service discovery, minimal instances, and worker/server restarts; added to H100 test list.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

coderabbitai · 2025-10-09T07:51:03Z

📝 Walkthrough

Walkthrough

Adds cluster-aware disaggregated serving: new cluster config plumbing from CLI and serve entrypoint into OpenAIServer and disaggregated server; dynamic router server management; cluster storage/manager lifecycle, readiness checks, and worker discovery. Introduces cluster configuration extraction utilities. Adds integration tests for auto-scaling, minimal instances, restarts, and test list entry.

Changes

Cohort / File(s)	Summary
Serve entrypoint and wiring `tensorrt_llm/commands/serve.py`	Adds disaggregated cluster support via new CLI option `disagg_cluster_uri`; derives `cluster_config` and passes it to server launch. Requires `server_role` when metadata or cluster config present. Updates `launch_server` signature to include `cluster_config`.
Disagg cluster config utilities `tensorrt_llm/llmapi/disagg_utils.py`	Extends `DisaggServerConfig` with optional `cluster_config`. Updates `extract_disagg_cfg` to accept `cluster` and build `cluster_config` when provided. Adds `extract_cluster_config(...)` and helper `update_dataclass(...)`.
OpenAI server (monolithic) cluster worker registration `tensorrt_llm/serve/openai_server.py`	Extends constructor to accept optional `cluster_config`. On startup, creates cluster storage and registers a `DisaggClusterWorker` with bound host/port. Stores binding details on instance.
Disaggregated OpenAI server with cluster management `tensorrt_llm/serve/openai_disagg_server.py`	Integrates `cluster_config`, cluster storage/manager, worker watch loop, dynamic router updates, readiness gating, `cluster_info` endpoint, and lifecycle cleanup. Enforces mutual exclusivity with metadata server.
Routers: dynamic server management `tensorrt_llm/serve/router.py`	Adds public `servers` property and abstract `add_server`/`remove_server`. Implements dynamic add/remove with locking for RoundRobin, LoadBalancing, and KvCacheAware routers; updates selection and load tracking. Refines `create_router` defaulting.
Cluster storage logging `tensorrt_llm/serve/cluster_storage.py`	Emits expired-keys debug log only when deletions occur; no functional change.
Integration tests: disagg auto-scaling `tests/integration/defs/disaggregated/test_auto_scaling.py`	New tests covering service discovery, minimal instances, worker and server restarts using HTTP readiness and OpenAI-compatible requests; includes fixtures and helper processes.
Test list update `tests/integration/test_lists/test-db/l0_h100.yml`	Adds the new disaggregated auto-scaling test to pre-merge H100 list.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Admin as Admin/CLI
  participant Serve as serve.py
  participant OAIS as OpenAIServer
  participant CS as ClusterStorage
  participant CW as ClusterWorker

  Admin->>Serve: serve --disagg_cluster_uri=...
  Serve->>Serve: derive cluster_config
  Serve->>OAIS: init(model, server_role, metadata_cfg, cluster_config)
  OAIS->>OAIS: store host/port on bind
  OAIS->>CS: create storage client
  OAIS->>CW: create and register worker (role from server_role)
  CW-->>CS: register(worker info)
  note over OAIS,CS: Worker registered for discovery

sequenceDiagram
  autonumber
  actor Admin as Admin/CLI
  participant Disagg as OpenAIDisaggServer
  participant CS as ClusterStorage
  participant CM as ClusterManager
  participant CtxR as ContextRouter
  participant GenR as GenerationRouter
  participant CtxW as Context Workers
  participant GenW as Generation Workers
  actor Client as Client

  Admin->>Disagg: start with cluster_config
  Disagg->>CS: create storage client
  Disagg->>CM: start manager, watch workers
  loop watch events
    CM-->>Disagg: worker up/down events
    Disagg->>CtxR: add/remove server (ctx role)
    Disagg->>GenR: add/remove server (gen role)
  end
  Client->>Disagg: completion/chat request
  Disagg->>Disagg: is_ready? (CM ready and routers non-empty)
  alt ready
    Disagg->>CtxR: select ctx server
    CtxR-->>CtxW: route context
    Disagg->>GenR: select gen server
    GenR-->>GenW: route generation
    GenW-->>Client: response
  else not ready
    Disagg-->>Client: 503/ready=false
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.94% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	⚠️ Warning	The pull request description is still the unmodified template and contains no actual title, no filled “Description” section, and no “Test Coverage” details. Required sections remain empty and placeholder guidance is present instead of concrete information. As a result, it does not meet the repository’s PR description requirements.	Please replace the template text with a properly formatted PR title according to the repository guidelines, provide a concise description of the changes and rationale in the “Description” section, list the relevant tests in the “Test Coverage” section, and update or check off items in the PR checklist to reflect the completed work.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title follows the required template with a valid JIRA ticket, correct type tag, and clearly summarizes the main feature implementation of disaggregated cluster auto‐scaling, matching the changeset’s focus.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

tensorrt_llm/serve/openai_disagg_server.py (1)

146-155: Initialize _update_worker_task before use

self._update_worker_task is only assigned inside the cluster-manager branch, but the teardown path always evaluates if self._update_worker_task:. When cluster support is disabled (the default today), that line raises AttributeError during shutdown and the server exits uncleanly. Initialize the attribute up front (e.g. to None) and guard the cancel/await block.
         self.conditional_disagg_config = config.conditional_disagg_config
         self.cluster_config = config.cluster_config
         self.cluster_storage = None
         self.cluster_manager = None
+        self._update_worker_task: asyncio.Task | None = None
@@
-            if self._update_worker_task:
-                self._update_worker_task.cancel()
+            if self._update_worker_task:
+                self._update_worker_task.cancel()
+                with contextlib.suppress(asyncio.CancelledError):
+                    await self._update_worker_task
+                self._update_worker_task = None

tensorrt_llm/llmapi/disagg_utils.py (2)

11-16: all exports undefined names

'ServerConfig' and 'extract_server_configs' are not defined in this module. This can break users relying on star-imports/public API.

Apply:

 __all__ = [
-    'ServerConfig',
     'parse_disagg_config_file',
-    'extract_server_configs',
+    'extract_disagg_cfg',
+    'extract_cluster_config',
+    'get_ctx_gen_server_urls',
+    'extract_ctx_gen_cfgs',
+    'extract_router_config',
+    'get_server_configs_dict',
     'split_world_comm',
+    'parse_metadata_server_config_file',
+    'CtxGenServerConfig',
+    'RouterConfig',
+    'ConditionalDisaggConfig',
+    'MinimalInstances',
+    'DisaggClusterConfig',
+    'DisaggServerConfig',
+    'MetadataServerConfig',
 ]

83-86: PEP 585 generics break on Python 3.8

Target is Python 3.8+. Use typing.List/Tuple instead of list[...] and tuple[...].

Apply:

-def get_ctx_gen_server_urls(
-        server_configs: list[CtxGenServerConfig]
-) -> tuple[list[str], list[str]]:
+def get_ctx_gen_server_urls(
+        server_configs: List[CtxGenServerConfig]
+) -> Tuple[List[str], List[str]]:

As per coding guidelines

tensorrt_llm/serve/cluster_storage.py (2)

44-54: Fix Queue.task_done() accounting in drain()

You consume N events but call task_done() only once. This breaks queue accounting (join() can hang, internal counters skew).

Apply:

 async def drain(self):
     events = []
-    event = await self.events.get()
+    event = await self.events.get()
     logger.debug(f"Draining watch event: {self.events.qsize()}")
     events.append(event)
-    while not self.events.empty():
-        event = self.events.get_nowait()
-        events.append(event)
-    self.events.task_done()
+    self.events.task_done()
+    while not self.events.empty():
+        event = self.events.get_nowait()
+        events.append(event)
+        self.events.task_done()
     logger.debug(f"after draining watch event: {self.events.qsize()}")
     return events

1-1: Add NVIDIA Apache-2.0 header

Source files must include the NVIDIA Apache-2.0 header with current year. Add at top. As per coding guidelines.

+# Copyright 2025 NVIDIA CORPORATION & AFFILIATES
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

tensorrt_llm/serve/openai_server.py (1)

1-1: Add NVIDIA Apache-2.0 header

Required on all source files. As per coding guidelines.

+#!/usr/bin/env python
+# Copyright 2025 NVIDIA CORPORATION & AFFILIATES
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
-#!/usr/bin/env python

🧹 Nitpick comments (11)

tensorrt_llm/llmapi/disagg_utils.py (2)

35-39: server_role should be Optional

Annotated as ServerRole but default is None → tighten typing.

Apply:
 @dataclass
 class RouterConfig():
     type: str = "round_robin"
     args: dict = field(default_factory=dict)
-    server_role: ServerRole = None
+    server_role: Optional[ServerRole] = None
As per coding guidelines

243-263: extract_cluster_config helper looks good; minor polish optional

Key validation + nested dataclass handling is solid. If you want to appease TRY003, shorten the KeyError message or move long text into the exception class. Otherwise, LGTM.

tests/integration/defs/disaggregated/test_auto_scaling.py (6)

127-146: Add timeouts and remove redundant sleep; fix f-string without placeholders

Avoid infinite waits and redundant delays.

Apply:

 async def wait_for_disagg_server_ready(port):
     while True:
-        await asyncio.sleep(3)
-        logger.info(f"Waiting for disagg server to be ready")
+        await asyncio.sleep(3)
+        logger.info("Waiting for disagg server to be ready")
         try:
-            info_resp = requests.get(f"http://localhost:{port}/cluster_info")
+            info_resp = requests.get(f"http://localhost:{port}/cluster_info",
+                                     timeout=REQUEST_TIMEOUT)
             if info_resp.status_code == 200:
                 info = info_resp.json()
                 if info["is_ready"]:
                     break
                 logger.info(
                     f"Waiting for disagg server to be ready: {info_resp.json()}"
                 )
             else:
                 logger.info(
                     f"Failed to get cluster info: {info_resp.status_code}")
-            await asyncio.sleep(3)
         except requests.exceptions.RequestException as e:
             logger.error(f"Failed to get cluster info: {e}")

148-158: Add request timeout to worker health checks

Reliability improvement.

Apply:

 async def wait_for_worker_ready(port):
     while True:
         await asyncio.sleep(3)
         logger.info(f"Waiting for worker {port} to be ready")
         try:
-            info_resp = requests.get(f"http://localhost:{port}/health")
+            info_resp = requests.get(f"http://localhost:{port}/health",
+                                     timeout=REQUEST_TIMEOUT)
             if info_resp.status_code == 200:
                 break
         except requests.exceptions.RequestException as e:
             logger.info(f"Failed to get worker info: {e}")

160-174: Add request timeout to cluster_info verification

Apply:

 def verify_cluster_info(ready,
                         ctx_workers=-1,
                         gen_workers=-1,
                         port=TEST_PORT,
                         expected_code=200):
-    info_resp = requests.get(f"http://localhost:{port}/cluster_info")
+    info_resp = requests.get(f"http://localhost:{port}/cluster_info",
+                             timeout=REQUEST_TIMEOUT)
     assert info_resp.status_code == expected_code

116-124: Avoid predictable /tmp path for config file

Use NamedTemporaryFile to reduce collision/injection risk.

Apply:

 def run_disagg_server(disagg_cluster_config, port=TEST_PORT):
-    disagg_server_config_path = f"/tmp/disagg_server_{port}_config.yaml"
-    disagg_cluster_config["port"] = port
-    with open(disagg_server_config_path, "w+") as f:
-        yaml.dump(disagg_cluster_config, f)
-    cmds = ["trtllm-serve", "disaggregated", "-c", disagg_server_config_path]
+    disagg_cluster_config["port"] = port
+    with tempfile.NamedTemporaryFile(mode="w+", suffix=f"_{port}_config.yaml",
+                                     delete=False) as tf:
+        yaml.dump(disagg_cluster_config, tf)
+        tf.flush()
+        disagg_server_config_path = tf.name
+    cmds = ["trtllm-serve", "disaggregated", "-c", disagg_server_config_path]
     f = open("disagg_server.log", "w+")
     p = subprocess.Popen(cmds, stdout=f, stderr=f)
     return p

259-326: Initialize proc vars; use specific exception type; mark router as used

Improves stability and lint cleanliness.

Apply:

 @pytest.mark.parametrize("router", ROUTER_TYPES, indirect=True)
 @pytest.mark.asyncio(loop_scope="module")
 @pytest.mark.timeout(600)
 async def test_worker_restart(model_name, disagg_server_config, worker_config,
                               router):
-    try:
+    assert router in ROUTER_TYPES
+    ctx_worker1 = None
+    ctx_worker2 = None
+    gen_worker1 = None
+    gen_worker2 = None
+    disagg_server = None
+    try:
@@
-        with pytest.raises(Exception):
+        from openai import APIStatusError
+        with pytest.raises(APIStatusError):
             request_completion(model_name, "Hello, my name is", port=TEST_PORT)
@@
-        with pytest.raises(Exception):
+        with pytest.raises(APIStatusError):
             request_completion(model_name, test_prompt, port=TEST_PORT)

331-363: Initialize proc vars; assert router used; tighten expected failure exception

Use AssertionError for verify_cluster_info(expected_code=500).

Apply:

 @pytest.mark.parametrize("router", ["round_robin"], indirect=True)
 @pytest.mark.asyncio(loop_scope="module")
 @pytest.mark.timeout(300)
 async def test_disagg_server_restart(model_name, disagg_server_config,
                                      worker_config, router):
-    try:
+    assert router in ROUTER_TYPES
+    ctx_worker1 = None
+    gen_worker1 = None
+    disagg_server = None
+    try:
@@
-        with pytest.raises(Exception):
+        with pytest.raises(AssertionError):
             verify_cluster_info(False, 1, 1, expected_code=500)

tensorrt_llm/serve/cluster_storage.py (2)

219-226: Type hint mismatch: get_prefix returns Dict, not List

Return type is Dict[str, str]. Please fix the annotation for clarity and tooling.

-    async def get_prefix(self,
-                         key_prefix: str,
-                         keys_only: bool = False) -> List[str]:
+    async def get_prefix(self,
+                         key_prefix: str,
+                         keys_only: bool = False) -> Dict[str, str]:

251-260: Reduce log noise in _notify_watch_event

logger.info runs for every watch handle, even when the key doesn’t match. Log only on actual notify (or downgrade to debug).

         for watch_key, handle in self._watch_handles.items():
             if key.startswith(watch_key):
                 # update queue immediately and wake up the event loop
                 handle.events.put_nowait(
                     WatchEvent(storage_item, event_type))
-            logger.info(
-                f"Notifying watch event for watch key {watch_key} with type {event_type}"
-            )
+                logger.info(
+                    f"Notifying watch event for watch key {watch_key} with type {event_type}"
+                )

tensorrt_llm/serve/openai_server.py (1)

81-83: Type hint: make metadata_server_cfg Optional

It can be None (create_metadata_server handles that), so reflect it in the signature.

-                 metadata_server_cfg: MetadataServerConfig,
+                 metadata_server_cfg: Optional[MetadataServerConfig],
                  cluster_config: Optional[DisaggClusterConfig] = None):

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 27677a3 and 459cfcf.

📒 Files selected for processing (8)

tensorrt_llm/commands/serve.py (6 hunks)
tensorrt_llm/llmapi/disagg_utils.py (5 hunks)
tensorrt_llm/serve/cluster_storage.py (1 hunks)
tensorrt_llm/serve/openai_disagg_server.py (10 hunks)
tensorrt_llm/serve/openai_server.py (5 hunks)
tensorrt_llm/serve/router.py (8 hunks)
tests/integration/defs/disaggregated/test_auto_scaling.py (1 hunks)
tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/serve/cluster_storage.py
tensorrt_llm/serve/openai_server.py
tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/openai_disagg_server.py
tests/integration/defs/disaggregated/test_auto_scaling.py
tensorrt_llm/commands/serve.py
tensorrt_llm/serve/router.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/serve/cluster_storage.py
tensorrt_llm/serve/openai_server.py
tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/openai_disagg_server.py
tests/integration/defs/disaggregated/test_auto_scaling.py
tensorrt_llm/commands/serve.py
tensorrt_llm/serve/router.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/serve/cluster_storage.py
tensorrt_llm/serve/openai_server.py
tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/openai_disagg_server.py
tests/integration/defs/disaggregated/test_auto_scaling.py
tensorrt_llm/commands/serve.py
tensorrt_llm/serve/router.py

🧬 Code graph analysis (7)

tensorrt_llm/serve/cluster_storage.py (1)

tensorrt_llm/logger.py (1)

debug (144-145)

tensorrt_llm/serve/openai_server.py (4)

tensorrt_llm/llmapi/disagg_utils.py (3)

DisaggClusterConfig (53-58)

MetadataServerConfig (75-80)

ServerRole (19-22)

tensorrt_llm/commands/serve.py (1)

serve (340-405)

tensorrt_llm/serve/cluster_storage.py (1)

create_cluster_storage_client (112-115)

tensorrt_llm/serve/disagg_auto_scaling.py (2)

DisaggClusterWorker (189-301)

register_worker (229-264)

tensorrt_llm/llmapi/disagg_utils.py (1)

tests/unittest/disaggregated/test_disagg_cluster_manager_worker.py (1)

config (34-41)

tensorrt_llm/serve/openai_disagg_server.py (5)

tensorrt_llm/llmapi/disagg_utils.py (2)

MetadataServerConfig (75-80)

ServerRole (19-22)

tensorrt_llm/commands/serve.py (1)

serve (340-405)

tensorrt_llm/serve/cluster_storage.py (4)

WatchEventType (26-28)

create_cluster_storage (106-109)

start (63-64)

start (158-161)

tensorrt_llm/serve/disagg_auto_scaling.py (5)

DisaggClusterManager (32-186)

watch_workers (90-104)

is_ready_with_router (184-186)

worker_info (218-222)

get_worker_events (110-125)

tensorrt_llm/serve/router.py (10)

KvCacheAwareRouter (580-707)

servers (168-169)

add_server (172-176)

add_server (444-454)

add_server (510-520)

add_server (628-638)

remove_server (179-183)

remove_server (456-467)

remove_server (522-533)

remove_server (642-650)

tests/integration/defs/disaggregated/test_auto_scaling.py (1)

tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (2)

env (54-61)

disagg_server (162-179)

tensorrt_llm/commands/serve.py (1)

tensorrt_llm/llmapi/disagg_utils.py (4)

DisaggClusterConfig (53-58)

MetadataServerConfig (75-80)

ServerRole (19-22)

extract_cluster_config (243-262)

tensorrt_llm/serve/router.py (2)

tests/unittest/disaggregated/test_router.py (1)

servers (50-51)

tensorrt_llm/logger.py (2)

warning (132-133)

debug (144-145)

🪛 Ruff (0.13.3)

tensorrt_llm/llmapi/disagg_utils.py

250-251: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/serve/openai_disagg_server.py

95-95: Avoid specifying long messages outside the exception class

(TRY003)

98-98: Avoid specifying long messages outside the exception class

(TRY003)

102-102: Avoid specifying long messages outside the exception class

(TRY003)

105-105: Avoid specifying long messages outside the exception class

(TRY003)

621-621: Do not assign a lambda expression, use a def

Rewrite worker_repr as a def

(E731)

636-636: Do not catch blind exception: Exception

(BLE001)

tests/integration/defs/disaggregated/test_auto_scaling.py

99-99: subprocess call: check for execution of untrusted input

(S603)

117-117: Probable insecure usage of temporary file or directory: "/tmp/disagg_server_"

(S108)

123-123: subprocess call: check for execution of untrusted input

(S603)

130-130: f-string without any placeholders

Remove extraneous f prefix

(F541)

132-132: Probable use of requests call without timeout

(S113)

153-153: Probable use of requests call without timeout

(S113)

165-165: Probable use of requests call without timeout

(S113)

181-182: try-except-pass detected, consider logging the exception

(S110)

181-181: Do not catch blind exception: Exception

(BLE001)

198-198: Unused function argument: router

(ARG001)

220-220: Unused function argument: router

(ARG001)

237-237: Do not assert blind exception: Exception

(B017)

260-260: Unused function argument: router

(ARG001)

276-276: Do not assert blind exception: Exception

(B017)

296-296: Do not assert blind exception: Exception

(B017)

332-332: Unused function argument: router

(ARG001)

349-349: Do not assert blind exception: Exception

(B017)

tensorrt_llm/commands/serve.py

402-403: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

402-403: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/serve/router.py

450-450: Consider [*old_servers.copy(), server] instead of concatenation

Replace with [*old_servers.copy(), server]

(RUF005)

517-517: Consider [*old_servers.copy(), server] instead of concatenation

Replace with [*old_servers.copy(), server]

(RUF005)

740-741: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (7)

tests/integration/test_lists/test-db/l0_h100.yml (1)

40-40: Auto-scaling test inclusion looks good

Nice to see the new disaggregated auto-scaling coverage wired into the H100 pre-merge suite. This aligns with the surrounding disaggregated test set.

tensorrt_llm/serve/cluster_storage.py (1)

280-283: Good: reduce noisy logs in idle cycles

Guarding the debug log avoids spam when nothing expires. LGTM.

tensorrt_llm/serve/openai_server.py (5)

31-33: Imports look fine

38-40: Imports look fine

86-92: Store binding fields eagerly

Storing binding_addr/host/port for later registration is reasonable.

140-143: Init cluster fields

Good to initialize to None to avoid attribute errors.

966-968: Persist host/port before serving

Required for worker registration. Looks good.

tensorrt_llm/commands/serve.py

tensorrt_llm/llmapi/disagg_utils.py

tensorrt_llm/serve/openai_server.py

tensorrt_llm/serve/router.py

tests/integration/defs/disaggregated/test_auto_scaling.py

coderabbitai · 2025-10-09T08:00:27Z

📝 Walkthrough

Walkthrough

Adds disaggregated cluster support across serve CLI, server initialization, and routers. Introduces cluster_config plumbing, cluster manager/worker/storage lifecycle, readiness gating, a /cluster_info endpoint, and dynamic router add/remove. Extends disaggregation utilities to construct cluster configs. Adjusts logging in cluster storage. Adds integration tests for auto-scaling and discovery.

Changes

Cohort / File(s)	Summary
Serve CLI & Launch Plumbing `tensorrt_llm/commands/serve.py`	Adds `--disagg_cluster_uri` option; updates `serve(...)` and `launch_server(...)` signatures to accept and pass `cluster_config`; resolves cluster config from CLI or config; enforces `server_role` when using metadata/cluster config; propagates to server startup.
Disaggregation Utilities `tensorrt_llm/llmapi/disagg_utils.py`	Extends `extract_disagg_cfg(...)` with `cluster`; adds `extract_cluster_config(...)`; augments `DisaggServerConfig` with `cluster_config`; initializes nested MinimalInstances; validates keys.
OpenAI Servers (Disagg + Base) `tensorrt_llm/serve/openai_disagg_server.py`, `tensorrt_llm/serve/openai_server.py`	Wires cluster support: new `cluster_config` parameter, cluster storage/manager/worker lifecycle in lifespan, readiness checks gating requests, `/cluster_info` endpoint, dynamic router updates on watch events, host/port exposure, cleanup on shutdown.
Routers: Dynamic Server Management `tensorrt_llm/serve/router.py`	Introduces `servers` property; adds async `add_server`/`remove_server` to abstract and concrete routers (RoundRobin, LoadBalancing, KvCacheAware); updates internal state with locking; `create_router` handles default type when config is None.
Cluster Storage Minor Adjustments `tensorrt_llm/serve/cluster_storage.py`	Emits expired-keys debug log only when deletions exist; no functional change to deletion/notification.
Integration Tests: Disaggregated Auto-Scaling `tests/integration/defs/disaggregated/test_auto_scaling.py`, `tests/integration/test_lists/test-db/l0_h100.yml`	Adds end-to-end tests covering discovery, minimal instances, worker/server restarts; helpers to spawn workers/server, readiness polling, cluster verification, OpenAI completion; registers test in H100 list.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CLI as serve CLI
    participant Serve as launch_server
    participant OAI as OpenAIServer
    participant OAI-D as OpenAIDisaggServer
    participant CS as ClusterStorage
    participant CM as ClusterManager
    participant CW as ClusterWorker

    rect rgba(230,240,255,0.4)
    note over CLI: Startup with disaggregated cluster
    CLI->>Serve: parse args (disagg_cluster_uri/server_role)
    Serve->>Serve: extract_cluster_config(...)
    Serve->>OAI: init(..., cluster_config)
    Serve->>OAI-D: init(..., cluster_config)
    end

    rect rgba(240,255,230,0.4)
    note over OAI,OAI-D: Lifespan startup
    OAI->>CS: create client (if cluster_config)
    OAI->>CW: init worker(host, port)
    OAI-->>CS: register worker (deferred in lifespan)

    OAI-D->>CS: create client (if cluster_config)
    OAI-D->>CM: start manager (watch workers)
    CM-->>OAI-D: watch events (add/remove)
    OAI-D->>OAI-D: _update_router_by_watch_events
    end

sequenceDiagram
    autonumber
    participant Client as HTTP Client
    participant OAI-D as OpenAIDisaggServer
    participant RouterC as Context Router
    participant RouterG as Generation Router

    Client->>OAI-D: GET /cluster_info
    OAI-D->>OAI-D: is_ready()
    OAI-D-->>Client: status, counts

    Client->>OAI-D: POST /v1/completions
    OAI-D->>OAI-D: is_ready()? (gate)
    alt ready
        OAI-D->>RouterC: get_next_server()
        OAI-D->>RouterG: get_next_server()
        OAI-D-->>Client: completion
    else not ready
        OAI-D-->>Client: 503/health not ready
    end

    note over OAI-D,RouterC: On CM events
    OAI-D->>RouterC: add_server/remove_server
    OAI-D->>RouterG: add_server/remove_server

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested reviewers

schetlur-nv
nv-guomingz
qiaoxj07
Tabrizian

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description includes the template structure but lacks any substantive content in the Description and Test Coverage sections, leaving the intended change and associated tests undefined.	Provide a meaningful summary in the Description section explaining the change and fill out the Test Coverage section with the relevant tests that validate the new functionality.
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.94% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title follows the required format including the ticket and type, concisely summarizes the main feature—implementing disaggregated cluster auto-scaling—and clearly conveys the primary change in a single sentence.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

tensorrt_llm/serve/openai_server.py (1)

77-83: Make metadata_server_cfg Optional to match call sites and factory

Type it as Optional and default to None to avoid surprising type mismatch and align with create_metadata_server usage.

-    def __init__(self,
-                 llm: Union[LLM, MultimodalEncoder],
-                 model: str,
-                 server_role: Optional[ServerRole],
-                 metadata_server_cfg: MetadataServerConfig,
-                 cluster_config: Optional[DisaggClusterConfig] = None):
+    def __init__(self,
+                 llm: Union[LLM, MultimodalEncoder],
+                 model: str,
+                 server_role: Optional[ServerRole],
+                 metadata_server_cfg: Optional[MetadataServerConfig] = None,
+                 cluster_config: Optional[DisaggClusterConfig] = None):

tensorrt_llm/serve/router.py (4)

248-275: Don’t assert in monitoring loop; handle empty/failed states and don’t re-raise

Asserting on runtime state and re-raising inside a long-lived task will kill monitoring and the server’s resilience. Log and continue instead.

-                assert final_servers, f"No {self._server_role} servers available"
+                if not final_servers:
+                    logger.warning(f"No {self._server_role} servers available")
+                    await asyncio.sleep(poll_interval)
+                    continue
@@
-            except Exception as e:
-                logger.error(f"Error in server monitoring: {e}")
-                raise
+            except Exception as e:
+                logger.error(f"Error in server monitoring: {e}")
+                await asyncio.sleep(poll_interval)
+                continue

381-399: Health-check should not assert on timeout; provide a safe default

When metadata_server_cfg is None or missing timeout, assert will raise. Use a default timeout.

-        assert self._health_check_timeout is not None, "health_check_timeout is not set"
+        timeout = self._health_check_timeout or 5.0
         try:
             async with self._session.get(
-                    f"{server_url}/health",
-                    timeout=self._health_check_timeout) as response:
+                    f"{server_url}/health",
+                    timeout=timeout) as response:

68-74: ServerState.is_healthy uses undefined self._session

This method will fail if used. Either pass a session or create one locally. Minimal fix below.

     async def is_healthy(self) -> bool:
-        try:
-            async with self._session.get(self._server + "/health") as response:
-                return response.status == 200
-        except Exception:
-            return False
+        try:
+            async with aiohttp.ClientSession() as session:
+                async with session.get(self._server + "/health") as response:
+                    return response.status == 200
+        except Exception:
+            return False

1-1: Missing NVIDIA Apache-2.0 header

Add the NVIDIA Apache-2.0 copyright header with current year at the top.

tensorrt_llm/commands/serve.py (1)

1-1: Missing NVIDIA Apache-2.0 header

Add the NVIDIA Apache-2.0 copyright header with current year at the top.

🧹 Nitpick comments (6)

tensorrt_llm/serve/openai_disagg_server.py (1)

620-638: Avoid assigning a lambda; define a small helper instead

Define worker_repr with def for clarity and to satisfy linters.
-        worker_repr = lambda worker_info: f"http://{worker_info.host}:{worker_info.port}"
+        def worker_repr(worker_info):
+            return f"http://{worker_info.host}:{worker_info.port}"
Additionally, consider narrowing the broad except to expected exceptions in this loop.

tensorrt_llm/serve/router.py (1)

167-170: Avoid exposing mutable internal state in servers property

Return a copy to prevent external mutation and lock bypass.
 @property
 def servers(self) -> List[str]:
-    return self._servers
+    return list(self._servers)

tests/integration/defs/disaggregated/test_auto_scaling.py (4)

130-146: Add timeouts to requests and remove unnecessary f-string

Avoid indefinite blocking in tests and fix F541.

-        logger.info(f"Waiting for disagg server to be ready")
+        logger.info("Waiting for disagg server to be ready")
         try:
-            info_resp = requests.get(f"http://localhost:{port}/cluster_info")
+            info_resp = requests.get(f"http://localhost:{port}/cluster_info",
+                                     timeout=5)
             if info_resp.status_code == 200:
                 info = info_resp.json()
                 if info["is_ready"]:
                     break
                 logger.info(
                     f"Waiting for disagg server to be ready: {info_resp.json()}"
                 )
             else:
                 logger.info(
                     f"Failed to get cluster info: {info_resp.status_code}")
             await asyncio.sleep(3)
         except requests.exceptions.RequestException as e:
             logger.error(f"Failed to get cluster info: {e}")

149-158: Add timeout to worker health checks

Prevents hangs.

-            info_resp = requests.get(f"http://localhost:{port}/health")
+            info_resp = requests.get(f"http://localhost:{port}/health",
+                                     timeout=5)

165-174: Add timeout to cluster_info verification requests

Prevents hangs.

-    info_resp = requests.get(f"http://localhost:{port}/cluster_info")
+    info_resp = requests.get(f"http://localhost:{port}/cluster_info",
+                             timeout=5)

176-183: Log exceptions on terminate for triage

Don’t silently swallow errors in integration harness.

 def terminate(*args):
     try:
         for arg in args:
             arg.terminate()
             arg.wait()
-    except Exception:
-        pass
+    except Exception as e:
+        logger.warning(f"terminate() encountered an error: {e}")

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 27677a3 and 459cfcf.

📒 Files selected for processing (8)

tensorrt_llm/commands/serve.py (6 hunks)
tensorrt_llm/llmapi/disagg_utils.py (5 hunks)
tensorrt_llm/serve/cluster_storage.py (1 hunks)
tensorrt_llm/serve/openai_disagg_server.py (10 hunks)
tensorrt_llm/serve/openai_server.py (5 hunks)
tensorrt_llm/serve/router.py (8 hunks)
tests/integration/defs/disaggregated/test_auto_scaling.py (1 hunks)
tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/serve/cluster_storage.py
tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/openai_server.py
tensorrt_llm/commands/serve.py
tensorrt_llm/serve/openai_disagg_server.py
tensorrt_llm/serve/router.py
tests/integration/defs/disaggregated/test_auto_scaling.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/serve/cluster_storage.py
tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/openai_server.py
tensorrt_llm/commands/serve.py
tensorrt_llm/serve/openai_disagg_server.py
tensorrt_llm/serve/router.py
tests/integration/defs/disaggregated/test_auto_scaling.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/serve/cluster_storage.py
tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/openai_server.py
tensorrt_llm/commands/serve.py
tensorrt_llm/serve/openai_disagg_server.py
tensorrt_llm/serve/router.py
tests/integration/defs/disaggregated/test_auto_scaling.py

🧬 Code graph analysis (7)

tensorrt_llm/serve/cluster_storage.py (1)

tensorrt_llm/logger.py (1)

debug (144-145)

tensorrt_llm/llmapi/disagg_utils.py (1)

tests/unittest/disaggregated/test_disagg_cluster_manager_worker.py (1)

config (34-41)

tensorrt_llm/serve/openai_server.py (4)

tensorrt_llm/llmapi/disagg_utils.py (2)

DisaggClusterConfig (53-58)

ServerRole (19-22)

tensorrt_llm/commands/serve.py (1)

serve (340-405)

tensorrt_llm/serve/cluster_storage.py (1)

create_cluster_storage_client (112-115)

tensorrt_llm/serve/disagg_auto_scaling.py (2)

DisaggClusterWorker (189-301)

register_worker (229-264)

tensorrt_llm/commands/serve.py (1)

tensorrt_llm/llmapi/disagg_utils.py (4)

DisaggClusterConfig (53-58)

MetadataServerConfig (75-80)

ServerRole (19-22)

extract_cluster_config (243-262)

tensorrt_llm/serve/openai_disagg_server.py (5)

tensorrt_llm/llmapi/disagg_utils.py (1)

ServerRole (19-22)

tensorrt_llm/commands/serve.py (1)

serve (340-405)

tensorrt_llm/serve/cluster_storage.py (4)

WatchEventType (26-28)

create_cluster_storage (106-109)

start (63-64)

start (158-161)

tensorrt_llm/serve/disagg_auto_scaling.py (6)

DisaggClusterManager (32-186)

watch_workers (90-104)

is_ready_with_router (184-186)

worker_info (218-222)

get_worker_events (110-125)

worker_id (214-215)

tensorrt_llm/serve/router.py (10)

KvCacheAwareRouter (580-707)

servers (168-169)

add_server (172-176)

add_server (444-454)

add_server (510-520)

add_server (628-638)

remove_server (179-183)

remove_server (456-467)

remove_server (522-533)

remove_server (642-650)

tensorrt_llm/serve/router.py (1)

tensorrt_llm/logger.py (2)

warning (132-133)

debug (144-145)

tests/integration/defs/disaggregated/test_auto_scaling.py (2)

tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (2)

env (54-61)

disagg_server (162-179)

tensorrt_llm/logger.py (1)

error (126-127)

🪛 Ruff (0.13.3)

tensorrt_llm/llmapi/disagg_utils.py

250-251: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/commands/serve.py

402-403: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

402-403: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/serve/openai_disagg_server.py

95-95: Avoid specifying long messages outside the exception class

(TRY003)

98-98: Avoid specifying long messages outside the exception class

(TRY003)

102-102: Avoid specifying long messages outside the exception class

(TRY003)

105-105: Avoid specifying long messages outside the exception class

(TRY003)

621-621: Do not assign a lambda expression, use a def

Rewrite worker_repr as a def

(E731)

636-636: Do not catch blind exception: Exception

(BLE001)

tensorrt_llm/serve/router.py

450-450: Consider [*old_servers.copy(), server] instead of concatenation

Replace with [*old_servers.copy(), server]

(RUF005)

517-517: Consider [*old_servers.copy(), server] instead of concatenation

Replace with [*old_servers.copy(), server]

(RUF005)

740-741: Avoid specifying long messages outside the exception class

(TRY003)

tests/integration/defs/disaggregated/test_auto_scaling.py

99-99: subprocess call: check for execution of untrusted input

(S603)

117-117: Probable insecure usage of temporary file or directory: "/tmp/disagg_server_"

(S108)

123-123: subprocess call: check for execution of untrusted input

(S603)

130-130: f-string without any placeholders

Remove extraneous f prefix

(F541)

132-132: Probable use of requests call without timeout

(S113)

153-153: Probable use of requests call without timeout

(S113)

165-165: Probable use of requests call without timeout

(S113)

181-182: try-except-pass detected, consider logging the exception

(S110)

181-181: Do not catch blind exception: Exception

(BLE001)

198-198: Unused function argument: router

(ARG001)

220-220: Unused function argument: router

(ARG001)

237-237: Do not assert blind exception: Exception

(B017)

260-260: Unused function argument: router

(ARG001)

276-276: Do not assert blind exception: Exception

(B017)

296-296: Do not assert blind exception: Exception

(B017)

332-332: Unused function argument: router

(ARG001)

349-349: Do not assert blind exception: Exception

(B017)

🔇 Additional comments (2)

tensorrt_llm/serve/cluster_storage.py (1)

280-283: LGTM: avoid noisy debug log when nothing expires

Conditionally logging only when deletions occur is appropriate.

tensorrt_llm/llmapi/disagg_utils.py (1)

71-71: LGTM: DisaggServerConfig gains cluster_config

Addition looks consistent with new server plumbing.

tensorrt_llm/commands/serve.py

tensorrt_llm/llmapi/disagg_utils.py

tensorrt_llm/serve/openai_disagg_server.py

tensorrt_llm/serve/router.py

tests/integration/test_lists/test-db/l0_h100.yml

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo · 2025-10-09T08:59:23Z

/bot run

tensorrt-cicd · 2025-10-09T09:04:47Z

PR_Github #20873 [ run ] triggered by Bot

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

tensorrt-cicd · 2025-10-09T09:40:11Z

PR_Github #20873 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #15788 completed with status: 'FAILURE'

tensorrt_llm/serve/openai_server.py

tensorrt_llm/serve/router.py

reasonsolo · 2025-10-10T03:06:28Z

/bot run

tensorrt-cicd · 2025-10-10T03:13:31Z

PR_Github #20956 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-10T04:00:05Z

PR_Github #20956 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #15850 completed with status: 'FAILURE'

reasonsolo · 2025-10-10T05:49:12Z

/bot run

reasonsolo · 2025-10-20T04:37:34Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-10-20T04:42:50Z

PR_Github #21848 [ run ] triggered by Bot. Commit: 7631dba

tensorrt-cicd · 2025-10-20T04:47:19Z

PR_Github #21848 [ run ] completed with state FAILURE. Commit: 7631dba
/LLM/main/L0_MergeRequest_PR pipeline #16470 completed with status: 'FAILURE'

reasonsolo · 2025-10-20T04:50:36Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-10-20T04:55:50Z

PR_Github #21851 [ run ] triggered by Bot. Commit: 7631dba

reasonsolo · 2025-10-20T05:36:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-10-20T05:41:25Z

PR_Github #21857 [ run ] triggered by Bot. Commit: feba150

tensorrt-cicd · 2025-10-20T05:41:27Z

PR_Github #21851 [ run ] completed with state ABORTED. Commit: 7631dba
LLM/main/L0_MergeRequest_PR #16473 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-10-20T09:20:52Z

PR_Github #21857 [ run ] completed with state SUCCESS. Commit: feba150
/LLM/main/L0_MergeRequest_PR pipeline #16477 completed with status: 'FAILURE'

reasonsolo · 2025-10-20T09:23:08Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-10-20T09:28:16Z

PR_Github #21885 [ run ] triggered by Bot. Commit: feba150

tensorrt-cicd · 2025-10-20T12:26:37Z

PR_Github #21885 [ run ] completed with state SUCCESS. Commit: feba150
/LLM/main/L0_MergeRequest_PR pipeline #16499 completed with status: 'FAILURE'

reasonsolo · 2025-10-21T01:54:12Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-10-21T02:00:11Z

PR_Github #21977 [ run ] triggered by Bot. Commit: 8f47771

tensorrt-cicd · 2025-10-21T06:51:27Z

PR_Github #21977 [ run ] completed with state SUCCESS. Commit: 8f47771
/LLM/main/L0_MergeRequest_PR pipeline #16573 completed with status: 'FAILURE'

reasonsolo · 2025-10-21T07:28:31Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-10-21T07:34:12Z

PR_Github #22022 [ run ] triggered by Bot. Commit: 8f47771

tensorrt-cicd · 2025-10-21T21:24:51Z

PR_Github #22022 [ run ] completed with state SUCCESS. Commit: 8f47771
/LLM/main/L0_MergeRequest_PR pipeline #16603 completed with status: 'SUCCESS'

juney-nvidia · 2025-10-22T22:53:31Z

@reasonsolo Hi Lizhi, do we have any doc or example telling users how to use this feature?

Thanks
June

reasonsolo · 2025-10-23T02:58:47Z

@reasonsolo Hi Lizhi, do we have any doc or example telling users how to use this feature?

Thanks June

PR for doc and example is ready at #8602

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo force-pushed the tllm7843_sdimpl_pr branch 2 times, most recently from f59ef4c to 79b56a0 Compare October 9, 2025 07:20

implement disagg cluster auto-scaling

459cfcf

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo force-pushed the tllm7843_sdimpl_pr branch from 79b56a0 to 459cfcf Compare October 9, 2025 07:31

reasonsolo marked this pull request as ready for review October 9, 2025 07:31

reasonsolo requested review from a team as code owners October 9, 2025 07:31

reasonsolo requested review from Shixiaowei02, chuangz0 and pcastonguay October 9, 2025 07:31

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

reasonsolo requested a review from hchings October 9, 2025 07:58

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

reasonsolo force-pushed the tllm7843_sdimpl_pr branch 2 times, most recently from 688a074 to 2fcd02a Compare October 9, 2025 08:49

address review bot's comments

fe52c68

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo force-pushed the tllm7843_sdimpl_pr branch from 2fcd02a to fe52c68 Compare October 9, 2025 08:56

remove kv_cache_aware's auto-scaling support

15d0cd2

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

pcastonguay reviewed Oct 9, 2025

View reviewed changes

tensorrt_llm/serve/openai_server.py Outdated Show resolved Hide resolved

tensorrt_llm/serve/router.py Show resolved Hide resolved

reasonsolo force-pushed the tllm7843_sdimpl_pr branch from e2f71af to 4ec6d1a Compare October 10, 2025 03:05

reasonsolo force-pushed the tllm7843_sdimpl_pr branch from 4ec6d1a to 2c5730b Compare October 10, 2025 05:48

reasonsolo added 2 commits October 20, 2025 13:30

Merge branch 'main' into tllm7843_sdimpl_pr

dbc31e5

Merge branch 'main' into tllm7843_sdimpl_pr

feba150

Merge branch 'main' into tllm7843_sdimpl_pr

8f47771

Shixiaowei02 approved these changes Oct 21, 2025

View reviewed changes

reasonsolo merged commit 23d5280 into NVIDIA:main Oct 21, 2025
5 checks passed

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 1, 2025

[TRTLLM-7843][feat] implement disagg cluster auto-scaling (NVIDIA#8215)

fec94e3

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025

[TRTLLM-7843][feat] implement disagg cluster auto-scaling (NVIDIA#8215)

79008c1

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025

[TRTLLM-7843][feat] implement disagg cluster auto-scaling (NVIDIA#8215)

0a15033

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025

[TRTLLM-7843][feat] implement disagg cluster auto-scaling (NVIDIA#8215)

ca133db

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

[TRTLLM-7843][feat] implement disagg cluster auto-scaling #8215

[TRTLLM-7843][feat] implement disagg cluster auto-scaling #8215

Uh oh!

Conversation

reasonsolo commented Oct 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Oct 9, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

reasonsolo commented Oct 9, 2025

Uh oh!

tensorrt-cicd commented Oct 9, 2025

Uh oh!

tensorrt-cicd commented Oct 9, 2025

Uh oh!

Uh oh!

Uh oh!

reasonsolo commented Oct 10, 2025

Uh oh!

tensorrt-cicd commented Oct 10, 2025

Uh oh!

tensorrt-cicd commented Oct 10, 2025

Uh oh!

reasonsolo commented Oct 10, 2025

Uh oh!

reasonsolo commented Oct 20, 2025

Uh oh!

tensorrt-cicd commented Oct 20, 2025

Uh oh!

tensorrt-cicd commented Oct 20, 2025

Uh oh!

reasonsolo commented Oct 20, 2025

Uh oh!

reasonsolo commented Oct 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 9, 2025 •

edited

Loading