Skip to content

Conversation

@Ratish1
Copy link
Contributor

@Ratish1 Ratish1 commented Dec 20, 2025

Motivation

This PR replaces the single global token bucket with a hierarchical RateLimiter manager. This allows the Gateway to enforce concurrency limits based on the identity of the requester (Tenant ID) and the specific model being requested (Model ID).

Modifications

  • Hierarchical Fallback: Permits are acquired using a specific-to-general matching strategy:
    1. (Tenant, Model)
    2. (Tenant, )
    3. (
    , Model)
    4. (*, *) (Global Default)

  • Tenant ID: Extracted from HTTP headers (configurable, defaults to X-Tenant-ID).

  • Model ID: Extracted from the X-Model-ID header OR peeked from the JSON body (OpenAI-compatible /v1/chat/completions).

  • Body Peeking Middleware: Middleware now supports peeking at request bodies to extract the model field while correctly reconstructing the request stream for subsequent handlers.

  • CLI: New flags and Python class properties allow defining complex rules using the tenant:model:max_concurrent:refill_rate syntax.

    CLI Flags:

  • --rate-limit-rule: Define a specific limit. Can be used multiple times.
    - Example: --rate-limit-rule "customer-a::10:10" (Limits customer-a to 10 total concurrency)
    - Example: --rate-limit-rule "
    :gpt-4:5:5" (Limits entire system to 5 concurrent gpt-4 streams)

  • --rate-limit-tenant-header: Specify a custom header for tenant identification.

  • Added py_test/integration_mock/test_multi_tenant_rate_limiting.py verifying isolation between different tenants and models using mock workers.

  • Added core::rate_limiter tests covering all hierarchy fallback scenarios.

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant