[model-gateway]: Implement hierarchical multi-tenant and model-based rate limiting #15517

Ratish1 · 2025-12-20T07:45:03Z

Motivation

This PR replaces the single global token bucket with a hierarchical RateLimiter manager. This allows the Gateway to enforce concurrency limits based on the identity of the requester (Tenant ID) and the specific model being requested (Model ID).

Modifications

Hierarchical Fallback: Permits are acquired using a specific-to-general matching strategy:
1. (Tenant, Model)
2. (Tenant, )
3. (, Model)
4. (*, *) (Global Default)
Tenant ID: Extracted from HTTP headers (configurable, defaults to X-Tenant-ID).
Model ID: Extracted from the X-Model-ID header OR peeked from the JSON body (OpenAI-compatible /v1/chat/completions).
Body Peeking Middleware: Middleware now supports peeking at request bodies to extract the model field while correctly reconstructing the request stream for subsequent handlers.
CLI: New flags and Python class properties allow defining complex rules using the tenant:model:max_concurrent:refill_rate syntax.

CLI Flags:
--rate-limit-rule: Define a specific limit. Can be used multiple times.
- Example: --rate-limit-rule "customer-a::10:10" (Limits customer-a to 10 total concurrency)
- Example: --rate-limit-rule ":gpt-4:5:5" (Limits entire system to 5 concurrent gpt-4 streams)
--rate-limit-tenant-header: Specify a custom header for tenant identification.
Added py_test/integration_mock/test_multi_tenant_rate_limiting.py verifying isolation between different tenants and models using mock workers.
Added core::rate_limiter tests covering all hierarchy fallback scenarios.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

…rate limiting

gemini-code-assist · 2025-12-20T07:45:07Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Ratish1 added 3 commits December 19, 2025 20:53

initial commit for rate-limiting

b82c447

[model-gateway]: Implement hierarchical multi-tenant and model-based …

e8941ac

…rate limiting

more

5a27b1d

Ratish1 requested review from ByronHsu, CatherineSue, key4ng and slin1237 as code owners December 20, 2025 07:45

github-actions bot added the model-gateway label Dec 20, 2025

Merge remote-tracking branch 'upstream/main' into rate-limiting

cf17e27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[model-gateway]: Implement hierarchical multi-tenant and model-based rate limiting #15517

[model-gateway]: Implement hierarchical multi-tenant and model-based rate limiting #15517

Ratish1 commented Dec 20, 2025

Uh oh!

gemini-code-assist bot commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[model-gateway]: Implement hierarchical multi-tenant and model-based rate limiting #15517

Are you sure you want to change the base?

[model-gateway]: Implement hierarchical multi-tenant and model-based rate limiting #15517

Conversation

Ratish1 commented Dec 20, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant