Skip to content

Inference Pool Level Model Name Redirect and Traffic Splitting #1811

@zetxqx

Description

@zetxqx

This issue tracks the implementation of the proposal to re-introduce model name redirection and traffic splitting functionality at the inference pool level.

Original proposal doc can be found here

Problem

The deprecation of InferenceModel has removed the ability to perform model name aliasing/versioning and granular traffic splitting within an inference pool. This functionality is crucial for use cases like gradual rollouts of new LoRA adapters without requiring client-side changes.

Proposed Solution

The proposal suggests introducing a new Custom Resource Definition (CRD) called InferenceModelRewrite. This CRD will contain the configuration for model redirection and traffic splitting.

The Endpoint Pool Proxy (EPP) will be responsible for:

  • Watching InferenceModelRewrite resources.
  • Parsing request bodies and modifying the model field based on the rewrite rules.
  • Handling weight-based traffic splitting.

The implementation will be done in two phases:

  • Phase 1: EPP-Driven Intra-Pool Rewrite: EPP will be enhanced to act as a read-only controller for the InferenceModelRewrite CRD, executing request body mutation and traffic splitting within a single InferencePool.
  • Phase 2 (Conditional): Promote Rewrite Logic to BBR: If necessary, the core rewrite/splitting logic can be moved into a shared library for both BBR and EPP, allowing BBR to make routing decisions after the model name has been rewritten.

Metadata

Metadata

Assignees

Labels

triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions