-
Notifications
You must be signed in to change notification settings - Fork 187
Description
This issue tracks the implementation of the proposal to re-introduce model name redirection and traffic splitting functionality at the inference pool level.
Original proposal doc can be found here
Problem
The deprecation of InferenceModel has removed the ability to perform model name aliasing/versioning and granular traffic splitting within an inference pool. This functionality is crucial for use cases like gradual rollouts of new LoRA adapters without requiring client-side changes.
Proposed Solution
The proposal suggests introducing a new Custom Resource Definition (CRD) called InferenceModelRewrite. This CRD will contain the configuration for model redirection and traffic splitting.
The Endpoint Pool Proxy (EPP) will be responsible for:
- Watching InferenceModelRewrite resources.
- Parsing request bodies and modifying the model field based on the rewrite rules.
- Handling weight-based traffic splitting.
The implementation will be done in two phases:
- Phase 1: EPP-Driven Intra-Pool Rewrite: EPP will be enhanced to act as a read-only controller for the
InferenceModelRewriteCRD, executing request body mutation and traffic splitting within a single InferencePool. - Phase 2 (Conditional): Promote Rewrite Logic to BBR: If necessary, the core rewrite/splitting logic can be moved into a shared library for both BBR and EPP, allowing BBR to make routing decisions after the model name has been rewritten.