-
Notifications
You must be signed in to change notification settings - Fork 2k
[TRTLLM-7843][feat] implement disagg cluster auto-scaling #8215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f59ef4c to
79b56a0
Compare
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
79b56a0 to
459cfcf
Compare
📝 WalkthroughWalkthroughAdds cluster-aware disaggregated serving: new cluster config plumbing from CLI and serve entrypoint into OpenAIServer and disaggregated server; dynamic router server management; cluster storage/manager lifecycle, readiness checks, and worker discovery. Introduces cluster configuration extraction utilities. Adds integration tests for auto-scaling, minimal instances, restarts, and test list entry. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Admin as Admin/CLI
participant Serve as serve.py
participant OAIS as OpenAIServer
participant CS as ClusterStorage
participant CW as ClusterWorker
Admin->>Serve: serve --disagg_cluster_uri=...
Serve->>Serve: derive cluster_config
Serve->>OAIS: init(model, server_role, metadata_cfg, cluster_config)
OAIS->>OAIS: store host/port on bind
OAIS->>CS: create storage client
OAIS->>CW: create and register worker (role from server_role)
CW-->>CS: register(worker info)
note over OAIS,CS: Worker registered for discovery
sequenceDiagram
autonumber
actor Admin as Admin/CLI
participant Disagg as OpenAIDisaggServer
participant CS as ClusterStorage
participant CM as ClusterManager
participant CtxR as ContextRouter
participant GenR as GenerationRouter
participant CtxW as Context Workers
participant GenW as Generation Workers
actor Client as Client
Admin->>Disagg: start with cluster_config
Disagg->>CS: create storage client
Disagg->>CM: start manager, watch workers
loop watch events
CM-->>Disagg: worker up/down events
Disagg->>CtxR: add/remove server (ctx role)
Disagg->>GenR: add/remove server (gen role)
end
Client->>Disagg: completion/chat request
Disagg->>Disagg: is_ready? (CM ready and routers non-empty)
alt ready
Disagg->>CtxR: select ctx server
CtxR-->>CtxW: route context
Disagg->>GenR: select gen server
GenR-->>GenW: route generation
GenW-->>Client: response
else not ready
Disagg-->>Client: 503/ready=false
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
tensorrt_llm/serve/openai_disagg_server.py (1)
146-155: Initialize_update_worker_taskbefore use
self._update_worker_taskis only assigned inside the cluster-manager branch, but the teardown path always evaluatesif self._update_worker_task:. When cluster support is disabled (the default today), that line raisesAttributeErrorduring shutdown and the server exits uncleanly. Initialize the attribute up front (e.g. toNone) and guard the cancel/await block.self.conditional_disagg_config = config.conditional_disagg_config self.cluster_config = config.cluster_config self.cluster_storage = None self.cluster_manager = None + self._update_worker_task: asyncio.Task | None = None @@ - if self._update_worker_task: - self._update_worker_task.cancel() + if self._update_worker_task: + self._update_worker_task.cancel() + with contextlib.suppress(asyncio.CancelledError): + await self._update_worker_task + self._update_worker_task = Nonetensorrt_llm/llmapi/disagg_utils.py (2)
11-16: all exports undefined names'ServerConfig' and 'extract_server_configs' are not defined in this module. This can break users relying on star-imports/public API.
Apply:
__all__ = [ - 'ServerConfig', 'parse_disagg_config_file', - 'extract_server_configs', + 'extract_disagg_cfg', + 'extract_cluster_config', + 'get_ctx_gen_server_urls', + 'extract_ctx_gen_cfgs', + 'extract_router_config', + 'get_server_configs_dict', 'split_world_comm', + 'parse_metadata_server_config_file', + 'CtxGenServerConfig', + 'RouterConfig', + 'ConditionalDisaggConfig', + 'MinimalInstances', + 'DisaggClusterConfig', + 'DisaggServerConfig', + 'MetadataServerConfig', ]
83-86: PEP 585 generics break on Python 3.8Target is Python 3.8+. Use typing.List/Tuple instead of list[...] and tuple[...].
Apply:
-def get_ctx_gen_server_urls( - server_configs: list[CtxGenServerConfig] -) -> tuple[list[str], list[str]]: +def get_ctx_gen_server_urls( + server_configs: List[CtxGenServerConfig] +) -> Tuple[List[str], List[str]]:As per coding guidelines
tensorrt_llm/serve/cluster_storage.py (2)
44-54: Fix Queue.task_done() accounting in drain()You consume N events but call task_done() only once. This breaks queue accounting (join() can hang, internal counters skew).
Apply:
async def drain(self): events = [] - event = await self.events.get() + event = await self.events.get() logger.debug(f"Draining watch event: {self.events.qsize()}") events.append(event) - while not self.events.empty(): - event = self.events.get_nowait() - events.append(event) - self.events.task_done() + self.events.task_done() + while not self.events.empty(): + event = self.events.get_nowait() + events.append(event) + self.events.task_done() logger.debug(f"after draining watch event: {self.events.qsize()}") return events
1-1: Add NVIDIA Apache-2.0 headerSource files must include the NVIDIA Apache-2.0 header with current year. Add at top. As per coding guidelines.
+# Copyright 2025 NVIDIA CORPORATION & AFFILIATES +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.tensorrt_llm/serve/openai_server.py (1)
1-1: Add NVIDIA Apache-2.0 headerRequired on all source files. As per coding guidelines.
+#!/usr/bin/env python +# Copyright 2025 NVIDIA CORPORATION & AFFILIATES +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. -#!/usr/bin/env python
🧹 Nitpick comments (11)
tensorrt_llm/llmapi/disagg_utils.py (2)
35-39: server_role should be OptionalAnnotated as ServerRole but default is None → tighten typing.
Apply:
@dataclass class RouterConfig(): type: str = "round_robin" args: dict = field(default_factory=dict) - server_role: ServerRole = None + server_role: Optional[ServerRole] = NoneAs per coding guidelines
243-263: extract_cluster_config helper looks good; minor polish optionalKey validation + nested dataclass handling is solid. If you want to appease TRY003, shorten the KeyError message or move long text into the exception class. Otherwise, LGTM.
tests/integration/defs/disaggregated/test_auto_scaling.py (6)
127-146: Add timeouts and remove redundant sleep; fix f-string without placeholdersAvoid infinite waits and redundant delays.
Apply:
async def wait_for_disagg_server_ready(port): while True: - await asyncio.sleep(3) - logger.info(f"Waiting for disagg server to be ready") + await asyncio.sleep(3) + logger.info("Waiting for disagg server to be ready") try: - info_resp = requests.get(f"http://localhost:{port}/cluster_info") + info_resp = requests.get(f"http://localhost:{port}/cluster_info", + timeout=REQUEST_TIMEOUT) if info_resp.status_code == 200: info = info_resp.json() if info["is_ready"]: break logger.info( f"Waiting for disagg server to be ready: {info_resp.json()}" ) else: logger.info( f"Failed to get cluster info: {info_resp.status_code}") - await asyncio.sleep(3) except requests.exceptions.RequestException as e: logger.error(f"Failed to get cluster info: {e}")
148-158: Add request timeout to worker health checksReliability improvement.
Apply:
async def wait_for_worker_ready(port): while True: await asyncio.sleep(3) logger.info(f"Waiting for worker {port} to be ready") try: - info_resp = requests.get(f"http://localhost:{port}/health") + info_resp = requests.get(f"http://localhost:{port}/health", + timeout=REQUEST_TIMEOUT) if info_resp.status_code == 200: break except requests.exceptions.RequestException as e: logger.info(f"Failed to get worker info: {e}")
160-174: Add request timeout to cluster_info verificationApply:
def verify_cluster_info(ready, ctx_workers=-1, gen_workers=-1, port=TEST_PORT, expected_code=200): - info_resp = requests.get(f"http://localhost:{port}/cluster_info") + info_resp = requests.get(f"http://localhost:{port}/cluster_info", + timeout=REQUEST_TIMEOUT) assert info_resp.status_code == expected_code
116-124: Avoid predictable /tmp path for config fileUse NamedTemporaryFile to reduce collision/injection risk.
Apply:
def run_disagg_server(disagg_cluster_config, port=TEST_PORT): - disagg_server_config_path = f"/tmp/disagg_server_{port}_config.yaml" - disagg_cluster_config["port"] = port - with open(disagg_server_config_path, "w+") as f: - yaml.dump(disagg_cluster_config, f) - cmds = ["trtllm-serve", "disaggregated", "-c", disagg_server_config_path] + disagg_cluster_config["port"] = port + with tempfile.NamedTemporaryFile(mode="w+", suffix=f"_{port}_config.yaml", + delete=False) as tf: + yaml.dump(disagg_cluster_config, tf) + tf.flush() + disagg_server_config_path = tf.name + cmds = ["trtllm-serve", "disaggregated", "-c", disagg_server_config_path] f = open("disagg_server.log", "w+") p = subprocess.Popen(cmds, stdout=f, stderr=f) return p
259-326: Initialize proc vars; use specific exception type; mark router as usedImproves stability and lint cleanliness.
Apply:
@pytest.mark.parametrize("router", ROUTER_TYPES, indirect=True) @pytest.mark.asyncio(loop_scope="module") @pytest.mark.timeout(600) async def test_worker_restart(model_name, disagg_server_config, worker_config, router): - try: + assert router in ROUTER_TYPES + ctx_worker1 = None + ctx_worker2 = None + gen_worker1 = None + gen_worker2 = None + disagg_server = None + try: @@ - with pytest.raises(Exception): + from openai import APIStatusError + with pytest.raises(APIStatusError): request_completion(model_name, "Hello, my name is", port=TEST_PORT) @@ - with pytest.raises(Exception): + with pytest.raises(APIStatusError): request_completion(model_name, test_prompt, port=TEST_PORT)
331-363: Initialize proc vars; assert router used; tighten expected failure exceptionUse AssertionError for verify_cluster_info(expected_code=500).
Apply:
@pytest.mark.parametrize("router", ["round_robin"], indirect=True) @pytest.mark.asyncio(loop_scope="module") @pytest.mark.timeout(300) async def test_disagg_server_restart(model_name, disagg_server_config, worker_config, router): - try: + assert router in ROUTER_TYPES + ctx_worker1 = None + gen_worker1 = None + disagg_server = None + try: @@ - with pytest.raises(Exception): + with pytest.raises(AssertionError): verify_cluster_info(False, 1, 1, expected_code=500)tensorrt_llm/serve/cluster_storage.py (2)
219-226: Type hint mismatch: get_prefix returns Dict, not ListReturn type is Dict[str, str]. Please fix the annotation for clarity and tooling.
- async def get_prefix(self, - key_prefix: str, - keys_only: bool = False) -> List[str]: + async def get_prefix(self, + key_prefix: str, + keys_only: bool = False) -> Dict[str, str]:
251-260: Reduce log noise in _notify_watch_eventlogger.info runs for every watch handle, even when the key doesn’t match. Log only on actual notify (or downgrade to debug).
for watch_key, handle in self._watch_handles.items(): if key.startswith(watch_key): # update queue immediately and wake up the event loop handle.events.put_nowait( WatchEvent(storage_item, event_type)) - logger.info( - f"Notifying watch event for watch key {watch_key} with type {event_type}" - ) + logger.info( + f"Notifying watch event for watch key {watch_key} with type {event_type}" + )tensorrt_llm/serve/openai_server.py (1)
81-83: Type hint: make metadata_server_cfg OptionalIt can be None (create_metadata_server handles that), so reflect it in the signature.
- metadata_server_cfg: MetadataServerConfig, + metadata_server_cfg: Optional[MetadataServerConfig], cluster_config: Optional[DisaggClusterConfig] = None):
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
tensorrt_llm/commands/serve.py(6 hunks)tensorrt_llm/llmapi/disagg_utils.py(5 hunks)tensorrt_llm/serve/cluster_storage.py(1 hunks)tensorrt_llm/serve/openai_disagg_server.py(10 hunks)tensorrt_llm/serve/openai_server.py(5 hunks)tensorrt_llm/serve/router.py(8 hunks)tests/integration/defs/disaggregated/test_auto_scaling.py(1 hunks)tests/integration/test_lists/test-db/l0_h100.yml(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/serve/cluster_storage.pytensorrt_llm/serve/openai_server.pytensorrt_llm/llmapi/disagg_utils.pytensorrt_llm/serve/openai_disagg_server.pytests/integration/defs/disaggregated/test_auto_scaling.pytensorrt_llm/commands/serve.pytensorrt_llm/serve/router.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/serve/cluster_storage.pytensorrt_llm/serve/openai_server.pytensorrt_llm/llmapi/disagg_utils.pytensorrt_llm/serve/openai_disagg_server.pytests/integration/defs/disaggregated/test_auto_scaling.pytensorrt_llm/commands/serve.pytensorrt_llm/serve/router.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/serve/cluster_storage.pytensorrt_llm/serve/openai_server.pytensorrt_llm/llmapi/disagg_utils.pytensorrt_llm/serve/openai_disagg_server.pytests/integration/defs/disaggregated/test_auto_scaling.pytensorrt_llm/commands/serve.pytensorrt_llm/serve/router.py
🧬 Code graph analysis (7)
tensorrt_llm/serve/cluster_storage.py (1)
tensorrt_llm/logger.py (1)
debug(144-145)
tensorrt_llm/serve/openai_server.py (4)
tensorrt_llm/llmapi/disagg_utils.py (3)
DisaggClusterConfig(53-58)MetadataServerConfig(75-80)ServerRole(19-22)tensorrt_llm/commands/serve.py (1)
serve(340-405)tensorrt_llm/serve/cluster_storage.py (1)
create_cluster_storage_client(112-115)tensorrt_llm/serve/disagg_auto_scaling.py (2)
DisaggClusterWorker(189-301)register_worker(229-264)
tensorrt_llm/llmapi/disagg_utils.py (1)
tests/unittest/disaggregated/test_disagg_cluster_manager_worker.py (1)
config(34-41)
tensorrt_llm/serve/openai_disagg_server.py (5)
tensorrt_llm/llmapi/disagg_utils.py (2)
MetadataServerConfig(75-80)ServerRole(19-22)tensorrt_llm/commands/serve.py (1)
serve(340-405)tensorrt_llm/serve/cluster_storage.py (4)
WatchEventType(26-28)create_cluster_storage(106-109)start(63-64)start(158-161)tensorrt_llm/serve/disagg_auto_scaling.py (5)
DisaggClusterManager(32-186)watch_workers(90-104)is_ready_with_router(184-186)worker_info(218-222)get_worker_events(110-125)tensorrt_llm/serve/router.py (10)
KvCacheAwareRouter(580-707)servers(168-169)add_server(172-176)add_server(444-454)add_server(510-520)add_server(628-638)remove_server(179-183)remove_server(456-467)remove_server(522-533)remove_server(642-650)
tests/integration/defs/disaggregated/test_auto_scaling.py (1)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (2)
env(54-61)disagg_server(162-179)
tensorrt_llm/commands/serve.py (1)
tensorrt_llm/llmapi/disagg_utils.py (4)
DisaggClusterConfig(53-58)MetadataServerConfig(75-80)ServerRole(19-22)extract_cluster_config(243-262)
tensorrt_llm/serve/router.py (2)
tests/unittest/disaggregated/test_router.py (1)
servers(50-51)tensorrt_llm/logger.py (2)
warning(132-133)debug(144-145)
🪛 Ruff (0.13.3)
tensorrt_llm/llmapi/disagg_utils.py
250-251: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/serve/openai_disagg_server.py
95-95: Avoid specifying long messages outside the exception class
(TRY003)
98-98: Avoid specifying long messages outside the exception class
(TRY003)
102-102: Avoid specifying long messages outside the exception class
(TRY003)
105-105: Avoid specifying long messages outside the exception class
(TRY003)
621-621: Do not assign a lambda expression, use a def
Rewrite worker_repr as a def
(E731)
636-636: Do not catch blind exception: Exception
(BLE001)
tests/integration/defs/disaggregated/test_auto_scaling.py
99-99: subprocess call: check for execution of untrusted input
(S603)
117-117: Probable insecure usage of temporary file or directory: "/tmp/disagg_server_"
(S108)
123-123: subprocess call: check for execution of untrusted input
(S603)
130-130: f-string without any placeholders
Remove extraneous f prefix
(F541)
132-132: Probable use of requests call without timeout
(S113)
153-153: Probable use of requests call without timeout
(S113)
165-165: Probable use of requests call without timeout
(S113)
181-182: try-except-pass detected, consider logging the exception
(S110)
181-181: Do not catch blind exception: Exception
(BLE001)
198-198: Unused function argument: router
(ARG001)
220-220: Unused function argument: router
(ARG001)
237-237: Do not assert blind exception: Exception
(B017)
260-260: Unused function argument: router
(ARG001)
276-276: Do not assert blind exception: Exception
(B017)
296-296: Do not assert blind exception: Exception
(B017)
332-332: Unused function argument: router
(ARG001)
349-349: Do not assert blind exception: Exception
(B017)
tensorrt_llm/commands/serve.py
402-403: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
402-403: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/serve/router.py
450-450: Consider [*old_servers.copy(), server] instead of concatenation
Replace with [*old_servers.copy(), server]
(RUF005)
517-517: Consider [*old_servers.copy(), server] instead of concatenation
Replace with [*old_servers.copy(), server]
(RUF005)
740-741: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tests/integration/test_lists/test-db/l0_h100.yml (1)
40-40: Auto-scaling test inclusion looks goodNice to see the new disaggregated auto-scaling coverage wired into the H100 pre-merge suite. This aligns with the surrounding disaggregated test set.
tensorrt_llm/serve/cluster_storage.py (1)
280-283: Good: reduce noisy logs in idle cyclesGuarding the debug log avoids spam when nothing expires. LGTM.
tensorrt_llm/serve/openai_server.py (5)
31-33: Imports look fine
38-40: Imports look fine
86-92: Store binding fields eagerlyStoring binding_addr/host/port for later registration is reasonable.
140-143: Init cluster fieldsGood to initialize to None to avoid attribute errors.
966-968: Persist host/port before servingRequired for worker registration. Looks good.
📝 WalkthroughWalkthroughAdds disaggregated cluster support across serve CLI, server initialization, and routers. Introduces cluster_config plumbing, cluster manager/worker/storage lifecycle, readiness gating, a /cluster_info endpoint, and dynamic router add/remove. Extends disaggregation utilities to construct cluster configs. Adjusts logging in cluster storage. Adds integration tests for auto-scaling and discovery. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CLI as serve CLI
participant Serve as launch_server
participant OAI as OpenAIServer
participant OAI-D as OpenAIDisaggServer
participant CS as ClusterStorage
participant CM as ClusterManager
participant CW as ClusterWorker
rect rgba(230,240,255,0.4)
note over CLI: Startup with disaggregated cluster
CLI->>Serve: parse args (disagg_cluster_uri/server_role)
Serve->>Serve: extract_cluster_config(...)
Serve->>OAI: init(..., cluster_config)
Serve->>OAI-D: init(..., cluster_config)
end
rect rgba(240,255,230,0.4)
note over OAI,OAI-D: Lifespan startup
OAI->>CS: create client (if cluster_config)
OAI->>CW: init worker(host, port)
OAI-->>CS: register worker (deferred in lifespan)
OAI-D->>CS: create client (if cluster_config)
OAI-D->>CM: start manager (watch workers)
CM-->>OAI-D: watch events (add/remove)
OAI-D->>OAI-D: _update_router_by_watch_events
end
sequenceDiagram
autonumber
participant Client as HTTP Client
participant OAI-D as OpenAIDisaggServer
participant RouterC as Context Router
participant RouterG as Generation Router
Client->>OAI-D: GET /cluster_info
OAI-D->>OAI-D: is_ready()
OAI-D-->>Client: status, counts
Client->>OAI-D: POST /v1/completions
OAI-D->>OAI-D: is_ready()? (gate)
alt ready
OAI-D->>RouterC: get_next_server()
OAI-D->>RouterG: get_next_server()
OAI-D-->>Client: completion
else not ready
OAI-D-->>Client: 503/health not ready
end
note over OAI-D,RouterC: On CM events
OAI-D->>RouterC: add_server/remove_server
OAI-D->>RouterG: add_server/remove_server
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
tensorrt_llm/serve/openai_server.py (1)
77-83: Make metadata_server_cfg Optional to match call sites and factoryType it as Optional and default to None to avoid surprising type mismatch and align with create_metadata_server usage.
- def __init__(self, - llm: Union[LLM, MultimodalEncoder], - model: str, - server_role: Optional[ServerRole], - metadata_server_cfg: MetadataServerConfig, - cluster_config: Optional[DisaggClusterConfig] = None): + def __init__(self, + llm: Union[LLM, MultimodalEncoder], + model: str, + server_role: Optional[ServerRole], + metadata_server_cfg: Optional[MetadataServerConfig] = None, + cluster_config: Optional[DisaggClusterConfig] = None):tensorrt_llm/serve/router.py (4)
248-275: Don’t assert in monitoring loop; handle empty/failed states and don’t re-raiseAsserting on runtime state and re-raising inside a long-lived task will kill monitoring and the server’s resilience. Log and continue instead.
- assert final_servers, f"No {self._server_role} servers available" + if not final_servers: + logger.warning(f"No {self._server_role} servers available") + await asyncio.sleep(poll_interval) + continue @@ - except Exception as e: - logger.error(f"Error in server monitoring: {e}") - raise + except Exception as e: + logger.error(f"Error in server monitoring: {e}") + await asyncio.sleep(poll_interval) + continue
381-399: Health-check should not assert on timeout; provide a safe defaultWhen metadata_server_cfg is None or missing timeout, assert will raise. Use a default timeout.
- assert self._health_check_timeout is not None, "health_check_timeout is not set" + timeout = self._health_check_timeout or 5.0 try: async with self._session.get( - f"{server_url}/health", - timeout=self._health_check_timeout) as response: + f"{server_url}/health", + timeout=timeout) as response:
68-74: ServerState.is_healthy uses undefined self._sessionThis method will fail if used. Either pass a session or create one locally. Minimal fix below.
async def is_healthy(self) -> bool: - try: - async with self._session.get(self._server + "/health") as response: - return response.status == 200 - except Exception: - return False + try: + async with aiohttp.ClientSession() as session: + async with session.get(self._server + "/health") as response: + return response.status == 200 + except Exception: + return False
1-1: Missing NVIDIA Apache-2.0 headerAdd the NVIDIA Apache-2.0 copyright header with current year at the top.
tensorrt_llm/commands/serve.py (1)
1-1: Missing NVIDIA Apache-2.0 headerAdd the NVIDIA Apache-2.0 copyright header with current year at the top.
🧹 Nitpick comments (6)
tensorrt_llm/serve/openai_disagg_server.py (1)
620-638: Avoid assigning a lambda; define a small helper insteadDefine worker_repr with def for clarity and to satisfy linters.
- worker_repr = lambda worker_info: f"http://{worker_info.host}:{worker_info.port}" + def worker_repr(worker_info): + return f"http://{worker_info.host}:{worker_info.port}"Additionally, consider narrowing the broad except to expected exceptions in this loop.
tensorrt_llm/serve/router.py (1)
167-170: Avoid exposing mutable internal state in servers propertyReturn a copy to prevent external mutation and lock bypass.
@property def servers(self) -> List[str]: - return self._servers + return list(self._servers)tests/integration/defs/disaggregated/test_auto_scaling.py (4)
130-146: Add timeouts to requests and remove unnecessary f-stringAvoid indefinite blocking in tests and fix F541.
- logger.info(f"Waiting for disagg server to be ready") + logger.info("Waiting for disagg server to be ready") try: - info_resp = requests.get(f"http://localhost:{port}/cluster_info") + info_resp = requests.get(f"http://localhost:{port}/cluster_info", + timeout=5) if info_resp.status_code == 200: info = info_resp.json() if info["is_ready"]: break logger.info( f"Waiting for disagg server to be ready: {info_resp.json()}" ) else: logger.info( f"Failed to get cluster info: {info_resp.status_code}") await asyncio.sleep(3) except requests.exceptions.RequestException as e: logger.error(f"Failed to get cluster info: {e}")
149-158: Add timeout to worker health checksPrevents hangs.
- info_resp = requests.get(f"http://localhost:{port}/health") + info_resp = requests.get(f"http://localhost:{port}/health", + timeout=5)
165-174: Add timeout to cluster_info verification requestsPrevents hangs.
- info_resp = requests.get(f"http://localhost:{port}/cluster_info") + info_resp = requests.get(f"http://localhost:{port}/cluster_info", + timeout=5)
176-183: Log exceptions on terminate for triageDon’t silently swallow errors in integration harness.
def terminate(*args): try: for arg in args: arg.terminate() arg.wait() - except Exception: - pass + except Exception as e: + logger.warning(f"terminate() encountered an error: {e}")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
tensorrt_llm/commands/serve.py(6 hunks)tensorrt_llm/llmapi/disagg_utils.py(5 hunks)tensorrt_llm/serve/cluster_storage.py(1 hunks)tensorrt_llm/serve/openai_disagg_server.py(10 hunks)tensorrt_llm/serve/openai_server.py(5 hunks)tensorrt_llm/serve/router.py(8 hunks)tests/integration/defs/disaggregated/test_auto_scaling.py(1 hunks)tests/integration/test_lists/test-db/l0_h100.yml(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/serve/cluster_storage.pytensorrt_llm/llmapi/disagg_utils.pytensorrt_llm/serve/openai_server.pytensorrt_llm/commands/serve.pytensorrt_llm/serve/openai_disagg_server.pytensorrt_llm/serve/router.pytests/integration/defs/disaggregated/test_auto_scaling.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/serve/cluster_storage.pytensorrt_llm/llmapi/disagg_utils.pytensorrt_llm/serve/openai_server.pytensorrt_llm/commands/serve.pytensorrt_llm/serve/openai_disagg_server.pytensorrt_llm/serve/router.pytests/integration/defs/disaggregated/test_auto_scaling.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/serve/cluster_storage.pytensorrt_llm/llmapi/disagg_utils.pytensorrt_llm/serve/openai_server.pytensorrt_llm/commands/serve.pytensorrt_llm/serve/openai_disagg_server.pytensorrt_llm/serve/router.pytests/integration/defs/disaggregated/test_auto_scaling.py
🧬 Code graph analysis (7)
tensorrt_llm/serve/cluster_storage.py (1)
tensorrt_llm/logger.py (1)
debug(144-145)
tensorrt_llm/llmapi/disagg_utils.py (1)
tests/unittest/disaggregated/test_disagg_cluster_manager_worker.py (1)
config(34-41)
tensorrt_llm/serve/openai_server.py (4)
tensorrt_llm/llmapi/disagg_utils.py (2)
DisaggClusterConfig(53-58)ServerRole(19-22)tensorrt_llm/commands/serve.py (1)
serve(340-405)tensorrt_llm/serve/cluster_storage.py (1)
create_cluster_storage_client(112-115)tensorrt_llm/serve/disagg_auto_scaling.py (2)
DisaggClusterWorker(189-301)register_worker(229-264)
tensorrt_llm/commands/serve.py (1)
tensorrt_llm/llmapi/disagg_utils.py (4)
DisaggClusterConfig(53-58)MetadataServerConfig(75-80)ServerRole(19-22)extract_cluster_config(243-262)
tensorrt_llm/serve/openai_disagg_server.py (5)
tensorrt_llm/llmapi/disagg_utils.py (1)
ServerRole(19-22)tensorrt_llm/commands/serve.py (1)
serve(340-405)tensorrt_llm/serve/cluster_storage.py (4)
WatchEventType(26-28)create_cluster_storage(106-109)start(63-64)start(158-161)tensorrt_llm/serve/disagg_auto_scaling.py (6)
DisaggClusterManager(32-186)watch_workers(90-104)is_ready_with_router(184-186)worker_info(218-222)get_worker_events(110-125)worker_id(214-215)tensorrt_llm/serve/router.py (10)
KvCacheAwareRouter(580-707)servers(168-169)add_server(172-176)add_server(444-454)add_server(510-520)add_server(628-638)remove_server(179-183)remove_server(456-467)remove_server(522-533)remove_server(642-650)
tensorrt_llm/serve/router.py (1)
tensorrt_llm/logger.py (2)
warning(132-133)debug(144-145)
tests/integration/defs/disaggregated/test_auto_scaling.py (2)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (2)
env(54-61)disagg_server(162-179)tensorrt_llm/logger.py (1)
error(126-127)
🪛 Ruff (0.13.3)
tensorrt_llm/llmapi/disagg_utils.py
250-251: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/commands/serve.py
402-403: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
402-403: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/serve/openai_disagg_server.py
95-95: Avoid specifying long messages outside the exception class
(TRY003)
98-98: Avoid specifying long messages outside the exception class
(TRY003)
102-102: Avoid specifying long messages outside the exception class
(TRY003)
105-105: Avoid specifying long messages outside the exception class
(TRY003)
621-621: Do not assign a lambda expression, use a def
Rewrite worker_repr as a def
(E731)
636-636: Do not catch blind exception: Exception
(BLE001)
tensorrt_llm/serve/router.py
450-450: Consider [*old_servers.copy(), server] instead of concatenation
Replace with [*old_servers.copy(), server]
(RUF005)
517-517: Consider [*old_servers.copy(), server] instead of concatenation
Replace with [*old_servers.copy(), server]
(RUF005)
740-741: Avoid specifying long messages outside the exception class
(TRY003)
tests/integration/defs/disaggregated/test_auto_scaling.py
99-99: subprocess call: check for execution of untrusted input
(S603)
117-117: Probable insecure usage of temporary file or directory: "/tmp/disagg_server_"
(S108)
123-123: subprocess call: check for execution of untrusted input
(S603)
130-130: f-string without any placeholders
Remove extraneous f prefix
(F541)
132-132: Probable use of requests call without timeout
(S113)
153-153: Probable use of requests call without timeout
(S113)
165-165: Probable use of requests call without timeout
(S113)
181-182: try-except-pass detected, consider logging the exception
(S110)
181-181: Do not catch blind exception: Exception
(BLE001)
198-198: Unused function argument: router
(ARG001)
220-220: Unused function argument: router
(ARG001)
237-237: Do not assert blind exception: Exception
(B017)
260-260: Unused function argument: router
(ARG001)
276-276: Do not assert blind exception: Exception
(B017)
296-296: Do not assert blind exception: Exception
(B017)
332-332: Unused function argument: router
(ARG001)
349-349: Do not assert blind exception: Exception
(B017)
🔇 Additional comments (2)
tensorrt_llm/serve/cluster_storage.py (1)
280-283: LGTM: avoid noisy debug log when nothing expiresConditionally logging only when deletions occur is appropriate.
tensorrt_llm/llmapi/disagg_utils.py (1)
71-71: LGTM: DisaggServerConfig gains cluster_configAddition looks consistent with new server plumbing.
688a074 to
2fcd02a
Compare
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2fcd02a to
fe52c68
Compare
|
/bot run |
|
PR_Github #20873 [ run ] triggered by Bot |
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
|
PR_Github #20873 [ run ] completed with state |
e2f71af to
4ec6d1a
Compare
|
/bot run |
|
PR_Github #20956 [ run ] triggered by Bot |
|
PR_Github #20956 [ run ] completed with state |
4ec6d1a to
2c5730b
Compare
|
/bot run |
|
/bot run --disable-fail-fast |
|
PR_Github #21848 [ run ] triggered by Bot. Commit: |
|
PR_Github #21848 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #21851 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #21857 [ run ] triggered by Bot. Commit: |
|
PR_Github #21851 [ run ] completed with state |
|
PR_Github #21857 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #21885 [ run ] triggered by Bot. Commit: |
|
PR_Github #21885 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #21977 [ run ] triggered by Bot. Commit: |
|
PR_Github #21977 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #22022 [ run ] triggered by Bot. Commit: |
|
PR_Github #22022 [ run ] completed with state |
|
@reasonsolo Hi Lizhi, do we have any doc or example telling users how to use this feature? Thanks |
PR for doc and example is ready at #8602 |
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Summary by CodeRabbit
New Features
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.