Skip to content

feat(proxy): nested model access groups with cycle-safe resolution (#28032)#28075

Open
wtfashwin wants to merge 15 commits into
BerriAI:litellm_internal_stagingfrom
wtfashwin:feat/nested-access-groups
Open

feat(proxy): nested model access groups with cycle-safe resolution (#28032)#28075
wtfashwin wants to merge 15 commits into
BerriAI:litellm_internal_stagingfrom
wtfashwin:feat/nested-access-groups

Conversation

@wtfashwin
Copy link
Copy Markdown

Closes #28032.

What

Adds hierarchical model access groups so a group can include other groups as members. Permission changes on a child group propagate to every parent that includes it, with no manual fan-out.

Example:

restricted-image     -> [dall-e-3, stable-diffusion-xl]
restricted-reasoning -> [o1, o3-mini]
project-x            -> [restricted-image, restricted-reasoning]

A key/team assigned project-x now resolves to all four real models.

Design

Storage

New LiteLLM_AccessGroupMembership(parent_group, child_group) table holds parent->child edges. Existing tag-based groups in model_info.access_groups are untouched - additive, fully backwards compatible.

model LiteLLM_AccessGroupMembership {
  id           String   @id @default(uuid())
  parent_group String
  child_group  String
  created_at   DateTime @default(now())

  @@unique([parent_group, child_group])
  @@index([parent_group])
}

Migration: litellm-proxy-extras/litellm_proxy_extras/migrations/20260516210000_add_access_group_membership/migration.sql with IF NOT EXISTS everywhere. Synced across all three schema.prisma files per the CLAUDE.md rule.

Resolution

resolve_nested_groups() does DFS expansion with on-path cycle detection. Cyclic edges are logged and skipped, not raised - this runs on the auth path and a malformed row must not 500 the proxy. On-path tracking (vs. ever-visited) means legitimate DAG-shared subtrees do NOT produce spurious cycle warnings.

_get_models_from_access_groups() gets a new optional group_memberships kwarg, defaulted to None. When absent/empty, behavior is bit-for-bit identical to today.

Write path

  • validate_models_exist() now accepts optional known_access_groups so a group's members can themselves be groups
  • _classify_member_names() splits the caller's list into real models / child groups / unknown
  • Real models -> existing update_deployments_with_access_group() path (tags deployments)
  • Child groups -> new upsert_group_memberships() path (writes edges, create_many(skip_duplicates=True))
  • Self-references (parent == child) are rejected at write time with 400
  • Multi-hop cycles are caught by the read-path guard

Both create_model_group and update_access_group go through a shared _dual_write_group_membership() helper. update clears existing parent-edges before re-inserting. delete cleans up edges where this group appears on either side via delete_group_membership_edges().

Request-path activation

get_available_models_for_user() (utils.py) and model_info_v1() (proxy_server.py) now fetch the membership map once via get_group_memberships_from_db() (single query, no N+1) and pass it through to get_key_models / get_team_models. When prisma_client is None the map stays empty and behavior is identical to today.

AccessGroupInfo API

Two new fields, always present (empty lists when no edges), additive only:

class AccessGroupInfo(BaseModel):
    access_group: str
    model_names: List[str]
    deployment_count: int
    parent_groups: List[str] = Field(default_factory=list)
    child_groups: List[str] = Field(default_factory=list)

get_all_access_groups_from_db() expands model_names transitively, surfaces pure-composition groups (groups that exist only in the membership table, no deployment tags), and populates parent/child fields.

Tests

35 tests in tests/test_litellm/proxy/auth/test_nested_access_groups.py, all pure functions, no DB / no fake Prisma client needed:

  • resolution: flat compat, single/multi-hop nesting, propagation, pure-composition, diamond
  • cycles: self-ref, indirect, 3-hop, self-ref nested deep in chain (each fires the warning exactly once at the correct node)
  • DAG correctness: shared subtree produces no false warnings, per-top-level visited isolation
  • scale: 50-level linear chain (recursion depth proof), 100-child fanout (no quadratic blowup)
  • classify: router-model precedence over group name, mixed input, input-order preservation
  • validate: accepts known groups, reports unknown, backwards compatible without groups, fail-closed when router is None
  • defensive: empty inputs, unknown seed, missing child reference, empty child list, unicode/special-char names
  • equivalence: group_memberships=None vs {} behave identically

Local checks

$ uv run --no-sync black --check <changed files>
All done!  6 files would be left unchanged.

$ uv run --no-sync ruff check .
All checks passed!

$ uv run pytest tests/test_litellm/proxy/auth/test_nested_access_groups.py -x -vv
============== 35 passed in 1.52s ==============

Backwards compatibility

Surface Guarantee
_get_models_from_access_groups without group_memberships Bit-for-bit identical
get_key_models, get_team_models, get_complete_model_list without new kwarg Identical
validate_models_exist without known_access_groups Identical
AccessGroupInfo response New fields are additive with [] defaults
model_info["access_groups"] storage Untouched

Out of scope (deliberate)

  • UI: dashboard tree-view / drag-and-drop editor for nested groups (separate PR)
  • New LiteLLM_AccessGroup parent table: would require migrating every existing tag into rows + rewriting every read path. The membership-table approach gets the feature in without that disruption
  • Per-request hot-path activation in user_api_key_auth: current activation is on the listing/info endpoints. Hot-path would need router-level caching of the membership map; can follow up if maintainers want it

Checklist

  • CLA signed
  • Scope isolated to one feature
  • Tests added (tests/test_litellm/proxy/auth/test_nested_access_groups.py)
  • make lint passes for changed files (ruff clean, black clean on touched files)
  • Schema synced across all 3 schema.prisma files + migration added
  • Prisma client methods only (no raw SQL); single query for membership read; create_many(skip_duplicates=True) for batch writes
  • Duplicate-PR check: no overlapping open or closed PR for [Feature]: Nested model access groups #28032 / nested access groups / AccessGroupMembership

shin-berri and others added 9 commits May 13, 2026 22:37
Holds parent/child edges between access groups so a group can include
another group's models. Schema + migration only; resolution next.

Refs BerriAI#28032
_get_models_from_access_groups takes a new optional group_memberships map
(parent -> [child_groups]) and expands recursively via _resolve_nested_groups.
Cycles are logged and skipped, not raised - this runs on the auth path and
a malformed row must not 500 the proxy.

Backwards compatible: callers that don't pass group_memberships get
identical behavior to today. get_key_models, get_team_models thread the
new kwarg through.

Refs BerriAI#28032
validate_models_exist accepts known access group names so a group's
members can themselves be groups. _classify_member_names splits the
caller's list into real models vs child groups, and each is routed to
its own write path.

create_model_group / update_access_group: classify -> dual-write
(deployments tagged for real models, parent->child edges for child groups).
update_access_group also clears prior edges where this group is parent
before re-inserting.

delete_access_group: also deletes membership edges where this group
appears on either side, to avoid dangling references.

AccessGroupInfo gains parent_groups + child_groups (always present,
empty by default). get_all_access_groups_from_db expands model_names
transitively via _resolve_nested_groups, surfaces pure-composition
groups that exist only in the membership table, and reports parent/
child relationships.

Self-references (parent == child) are rejected at write time with 400.
Multi-hop cycles are caught by the read-path DFS guard from the prior
commit.

Refs BerriAI#28032
Makes the feature actually active end-to-end. get_available_models_for_user
(utils.py) and model_info_v1 (proxy_server.py) now fetch group_memberships
once per call and pass them to get_key_models / get_team_models. When
prisma_client is None (SDK-only mode) the map stays empty and behavior is
identical to today.

Also:
- Rename _resolve_nested_groups -> resolve_nested_groups since it's used
  across modules; leading underscore was misleading.
- Update NewModelGroupResponse.models_updated field comment to reflect
  the new semantics (deployment tags + membership edges, not just models).

Refs BerriAI#28032
11 unit tests for the nested-groups feature:
- resolve_nested_groups: flat compat, single/multi-hop nesting, self- and
  indirect cycles (warning logged + safe degradation), propagation,
  pure-composition groups, diamond inheritance
- _classify_member_names: router model takes precedence over group name
  when both match; mixed input splits into the three buckets
- validate_models_exist: accepts known group names; reports truly unknown;
  backwards compatible when no groups passed

All pure-function tests - no DB / no fake Prisma client needed.

Refs BerriAI#28032
Previously the visited set tracked every node ever seen during a top-level
resolution. That meant DAG-shared subtrees (A -> [B, C], B -> D, C -> D)
triggered a spurious 'cycle detected' warning on D's second visit.

Switch to on-path tracking: discard the node from visited when its frame
unwinds. Real cycles still fire the warning; DAG re-traversal is silent.
Caller already dedups the final list.

Adds a regression test (test_dag_shared_group_subtree_no_false_cycle_warning)
that asserts no warning is logged for the legitimate DAG case.

Refs BerriAI#28032
Adds 22 tests covering pathological/edge inputs to lock the algorithm
against silent regressions:

- empty inputs: all_models=[], group_memberships={}, dangling groups
- defensive: unknown seed, membership pointing to missing child,
  null router, empty child list
- cycles: 3-hop (A->B->C->A), self-ref nested deep in chain, message
  identifies the right node
- scale: 50-level linear chain (no stack overflow), 100-child fanout
- DAG / dedup: shared subtrees produce no false warnings, duplicate
  top-level groups expand independently, visited isolation between
  top-level calls
- format: unicode/special-char group names, classify preserves input
  order for error legibility
- equivalence: group_memberships=None vs {} behave identically

Total: 35 tests against pure functions, no DB needed.

Refs BerriAI#28032
ruff was unhappy with update_access_group at 52 statements. extracted
two helpers that create/update/delete all needed anyway:

- _clear_access_group_from_all_deployments (the deployment-tag loop)
- _dual_write_group_membership (classify then route to deployments
  or memberships)

endpoints are shorter now. PLR0915 happy.

refs BerriAI#28032
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ wtfashwin
❌ shin-berri
You have signed the CLA already but the status is still pending? Let us recheck it.

CI's recursive_detector forbids new recursive functions (it has caused
CPU spikes in the past). Switching to an explicit stack with ENTER/EXIT
frame markers preserves on-path cycle tracking: visited grows on ENTER,
shrinks on EXIT. Real cycles still log+skip; DAG-shared subtrees
re-traverse silently as before.

Same algorithm, same O(V + E) complexity, no Python recursion overhead.
All 35 tests pass unchanged.

Refs BerriAI#28032
@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2026

wtfashwin added 2 commits May 17, 2026 00:37
15 tests using a hand-rolled fake Prisma client (no Postgres required):

- get_group_memberships_from_db: bucketing, empty table
- upsert_group_memberships: empty-list noop, self-reference rejection,
  batch create_many with skip_duplicates=True
- delete_group_membership_edges: OR-clause covers parent and child
- _clear_access_group_from_all_deployments: only touches deployments
  carrying the tag, no-op when nothing matches
- _dual_write_group_membership: real-models path, child-groups path,
  mixed input, null router
- get_all_access_groups_from_db: flat-only (backwards compat), nested
  expansion of model_names + parent/child fields, pure-composition
  groups (in membership table only, no deployment tags)

Total nested-group test count: 50 (35 pure-function + 15 async DB).

Refs BerriAI#28032
8 existing tests broke because they pass a plain MagicMock as the
prisma_client and the new awaited find_many trips on await.

same can hit prod - the proxy can boot before litellm-proxy-extras
finishes prisma migrate deploy, so the model is briefly missing
from the prisma client.

wrap find_many in a narrow try/except (AttributeError, TypeError),
return {} on miss. auth path falls back to today's flat-group
behavior instead of 500'ing. two regression tests so we don't
forget.

refs BerriAI#28032
@wtfashwin wtfashwin marked this pull request as ready for review May 16, 2026 19:31
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 16, 2026

Greptile Summary

This PR adds hierarchical model access groups to the LiteLLM proxy, allowing a group to include other groups as members with transitive DFS resolution and cycle-safe skipping. All five issues raised in the previous review round have been addressed: validate_models_exist now consults known_access_groups even when llm_router is None; get_group_memberships_from_db catches broad Exception; a second migration adds the missing child_group index; membership lookups are now served from a 60-second in-process TTL cache with explicit write-side invalidation; and update_access_group Step 1b no longer clears parent→child edges when the model_ids path is used.

  • Resolution logic (model_checks.py): Iterative DFS with on-path cycle detection replaces the previous flat lookup; callers already copy lists so in-place mutation is safe; deduplication via dict.fromkeys preserves order.
  • Storage & migration: New LiteLLM_AccessGroupMembership table with (parent_group, child_group) unique constraint, indexes on both columns across all three schema.prisma files, and IF NOT EXISTS guards in both migration files.
  • Test coverage: 35 pure-function tests plus dedicated files for cache semantics, DB helpers, and validate/classify logic — all mock-only, no real network calls.

Confidence Score: 5/5

All five previously flagged defects are remediated; the change is additive, backwards-compatible, and confined to listing endpoints rather than the per-request auth hot path.

The iterative DFS is correct (on-path visited tracking, safe for DAG-shared subtrees), the TTL cache invalidates synchronously on every write within the same process, the broad exception catch in the DB helper ensures the auth path cannot 500, the child_group index is present in all three schema files and both migration files, and the model_ids update path no longer silently destroys nested-group edges. No new defects were found during this review pass.

No files require special attention. The migration and schema changes are consistent across all three Prisma schema copies.

Important Files Changed

Filename Overview
litellm/proxy/auth/model_checks.py Adds iterative DFS resolve_nested_groups (cycle-safe, on-path visited tracking) and wires group_memberships kwarg through _get_models_from_access_groups, get_key_models, and get_team_models; callers already copy lists before passing, so in-place mutation is safe.
litellm/proxy/management_endpoints/model_access_group_management_endpoints.py Adds TTL-cached membership map (60s, explicit invalidation on writes), validate_models_exist now checks known_access_groups even when llm_router is None, get_group_memberships_from_db catches broad Exception, and update_access_group Step 1b only clears edges when not in model_ids path — all five previously flagged issues addressed.
litellm/proxy/utils.py Wires get_cached_group_memberships into get_available_models_for_user, gated on prisma_client is not None; falls back to empty dict preserving flat behavior when no DB is configured.
litellm/proxy/proxy_server.py Adds the same TTL-cached membership fetch to model_info_v1 and passes group_memberships to get_key_models/get_team_models; change is confined to the listing endpoint, not the per-request auth path.
litellm-proxy-extras/litellm_proxy_extras/migrations/20260517100000_index_child_group_on_access_group_membership/migration.sql Adds IF NOT EXISTS index on child_group column, fixing the previously flagged full-table-scan on delete_group_membership_edges; all three schema.prisma files include @@index([child_group]).
tests/test_litellm/proxy/auth/test_nested_access_groups_cache.py Six tests cover cache hit/miss, explicit invalidation via upsert/delete, TTL expiry, within-TTL no-refetch, and error-path empty-dict caching; uses autouse fixture to reset module-level state between tests.
tests/test_litellm/proxy/auth/test_nested_access_groups.py 35 pure-function tests covering flat compat, multi-hop nesting, diamond DAGs, cycle detection (self-ref, indirect, 3-hop), scale (50-level chain, 100-child fanout), and group_memberships=None vs {} equivalence; no real network calls.
litellm/types/proxy/management_endpoints/model_management_endpoints.py Adds parent_groups and child_groups fields to AccessGroupInfo with default_factory=list; additive-only, fully backwards-compatible.

Reviews (3): Last reviewed commit: "test(proxy): split nested-access-groups ..." | Re-trigger Greptile

Comment thread litellm/proxy/management_endpoints/model_access_group_management_endpoints.py Outdated
Comment thread litellm/proxy/utils.py
four things from the bot review:

1. validate_models_exist bailed early with (False, all_names) when
   llm_router was None - never even checked known_access_groups.
   db-only setups couldn't create a group of groups at all. now
   we check known_groups even without a router.

2. get_group_memberships_from_db only caught AttributeError +
   TypeError, so a transient DB error would 500 every /v1/models
   call. broadened to Exception - the fallback now actually
   catches what the docstring says it does.

3. delete_group_membership_edges does WHERE parent OR child but
   only parent had an index. added @@index([child_group]) + a
   tiny migration.

4. get_group_memberships_from_db was being called on every
   model-listing request. added a 60s in-process TTL cache via
   get_cached_group_memberships, with explicit invalidation on
   every membership write. same shape as the existing
   llm_router.get_model_access_groups() cache.

9 tests for the above.

refs BerriAI#28032
@wtfashwin wtfashwin force-pushed the feat/nested-access-groups branch from 4f1252c to c5e2c6c Compare May 16, 2026 19:58
@wtfashwin
Copy link
Copy Markdown
Author

@greptileai please re-review — addressed all four items from the previous review:

  1. validate_models_exist now consults known_access_groups even when llm_router is None, so DB-only deployments can create groups of groups
  2. get_group_memberships_from_db now catches Exception so transient DB errors degrade to flat-group fallback instead of 500-ing model-listing requests
  3. added @@index([child_group]) on LiteLLM_AccessGroupMembership + a migration
  4. added a 60s per-process TTL cache (get_cached_group_memberships) for the membership map with explicit invalidation on every write — hot-path callers in utils.py and proxy_server.py now go through it

61 unit tests total covering pure-function, async-DB-helper, and cache (hit/miss/invalidation/TTL/error-cache) behavior. All green locally + on CI.

Comment thread litellm/proxy/management_endpoints/model_access_group_management_endpoints.py Outdated
@oss-pr-review-agent-shin
Copy link
Copy Markdown
Contributor

🤖 litellm-agent: This PR is currently BLOCKED from merge.

Score: 3/5

Why blocked:

  • 1 PR-related CI failure (Size gate: 1 file(s) over 500 added LOC — split first (tests/test_litellm/proxy/auth/test_nested_access_groups.py (+577)). Add the oversized-ok label or a Big-PR-Approved: <handle> trailer if a maintainer has signed off.) (pr_related_failures, -2 pts)

Details: Score docked for: 1 PR-related CI failure (Size gate: 1 file(s) over 500 added LOC — split first (tests/test_litellm/proxy/auth/test_nested_access_groups.py (+577)). Add the oversized-ok label or a Big-PR-Approved: <handle> trailer if a maintainer has signed off.).

Fix the issues above and push an update — the bot will re-review automatically.

Note: This bot is still in beta and might not always work as expected. Please share any feedback via Slack.

wtfashwin added 2 commits May 17, 2026 01:50
update_access_group cleared parent->child edges unconditionally but only re-added them on the model_names path. update via model_ids would
wipe the membership table for the group and never refill it - so a
project-x -> [image, reasoning] composition vanished after a single
model_ids update. gate the edge-clear behind not use_model_ids.

adds two regression tests covering both branches.
577-LOC test file tripped the +500 added-LOC size gate. classify +
validate coverage moves to a sibling file; the original keeps the DFS
and edge-case suite. each file is now under 500 LOC.
@oss-pr-review-agent-shin
Copy link
Copy Markdown
Contributor

🤖 litellm-agent: This PR is currently BLOCKED from merge.

Score: 0/5

Why blocked:

  • all Phase B agent checks non-approving (phase_b_none_approved, -5 pts)

Details: Score docked for: all Phase B agent checks non-approving (karpathy + security + coverage gap).

Fix the issues above and push an update — the bot will re-review automatically.

Note: This bot is still in beta and might not always work as expected. Please share any feedback via Slack.

@wtfashwin
Copy link
Copy Markdown
Author

@krrishdholakia @ishaan-jaff — ready for a look when you have a moment.

Where it stands:

  • All 43 CI checks green on d0951cb
  • Greptile round 2 cleared at 4/5; the single P1 it raised (update_access_group destroying nested edges on the model_ids path) is fixed in ac807e2 with two regression tests
  • 63 unit tests across the 4 new test files, all green
  • Size gate cleared by splitting the test file (d0951cb) — every file now under 500 LOC

litellm-agent is sitting at 0/5 after the size gate cleared, with no per-agent detail on what Phase B caught (karpathy / security / coverage). For what it's worth, the codecov "missing lines" all sit in files this PR doesn't touch (mcp_server/server.py, prometheus.py, gemini transformation, user_api_key_auth_mcp.py) — pre-existing gaps in main that get attributed to anyone whose diff lands at the same time. Happy to address concrete feedback if the bot can surface specifics, but flagging in case it's a beta hiccup.

PR scope recap: hierarchical access groups via new LiteLLM_AccessGroupMembership join table + iterative DFS resolver with on-path cycle tracking + 60s TTL cache on the hot read path. Backwards compatible — callers without the new group_memberships kwarg behave identically to today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Nested model access groups

3 participants