[model-gateway]: Implement hierarchical multi-tenant and model-based rate limiting #15517
+578
−120
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
This PR replaces the single global token bucket with a hierarchical RateLimiter manager. This allows the Gateway to enforce concurrency limits based on the identity of the requester (Tenant ID) and the specific model being requested (Model ID).Modifications
Hierarchical Fallback: Permits are acquired using a specific-to-general matching strategy:
1. (Tenant, Model)
2. (Tenant, )
3. (, Model)
4. (*, *) (Global Default)
Tenant ID: Extracted from HTTP headers (configurable, defaults to X-Tenant-ID).
Model ID: Extracted from the X-Model-ID header OR peeked from the JSON body (OpenAI-compatible /v1/chat/completions).
Body Peeking Middleware: Middleware now supports peeking at request bodies to extract the model field while correctly reconstructing the request stream for subsequent handlers.
CLI: New flags and Python class properties allow defining complex rules using the tenant:model:max_concurrent:refill_rate syntax.
CLI Flags:
--rate-limit-rule: Define a specific limit. Can be used multiple times.
- Example: --rate-limit-rule "customer-a::10:10" (Limits customer-a to 10 total concurrency)
- Example: --rate-limit-rule ":gpt-4:5:5" (Limits entire system to 5 concurrent gpt-4 streams)
--rate-limit-tenant-header: Specify a custom header for tenant identification.
Added py_test/integration_mock/test_multi_tenant_rate_limiting.py verifying isolation between different tenants and models using mock workers.
Added core::rate_limiter tests covering all hierarchy fallback scenarios.
Accuracy Tests
Benchmarking and Profiling
Checklist