feat(proxy): nested model access groups with cycle-safe resolution (#28032)#28075
feat(proxy): nested model access groups with cycle-safe resolution (#28032)#28075wtfashwin wants to merge 15 commits into
Conversation
[Infra] Promote internal staging to main
Holds parent/child edges between access groups so a group can include another group's models. Schema + migration only; resolution next. Refs BerriAI#28032
_get_models_from_access_groups takes a new optional group_memberships map (parent -> [child_groups]) and expands recursively via _resolve_nested_groups. Cycles are logged and skipped, not raised - this runs on the auth path and a malformed row must not 500 the proxy. Backwards compatible: callers that don't pass group_memberships get identical behavior to today. get_key_models, get_team_models thread the new kwarg through. Refs BerriAI#28032
validate_models_exist accepts known access group names so a group's members can themselves be groups. _classify_member_names splits the caller's list into real models vs child groups, and each is routed to its own write path. create_model_group / update_access_group: classify -> dual-write (deployments tagged for real models, parent->child edges for child groups). update_access_group also clears prior edges where this group is parent before re-inserting. delete_access_group: also deletes membership edges where this group appears on either side, to avoid dangling references. AccessGroupInfo gains parent_groups + child_groups (always present, empty by default). get_all_access_groups_from_db expands model_names transitively via _resolve_nested_groups, surfaces pure-composition groups that exist only in the membership table, and reports parent/ child relationships. Self-references (parent == child) are rejected at write time with 400. Multi-hop cycles are caught by the read-path DFS guard from the prior commit. Refs BerriAI#28032
Makes the feature actually active end-to-end. get_available_models_for_user (utils.py) and model_info_v1 (proxy_server.py) now fetch group_memberships once per call and pass them to get_key_models / get_team_models. When prisma_client is None (SDK-only mode) the map stays empty and behavior is identical to today. Also: - Rename _resolve_nested_groups -> resolve_nested_groups since it's used across modules; leading underscore was misleading. - Update NewModelGroupResponse.models_updated field comment to reflect the new semantics (deployment tags + membership edges, not just models). Refs BerriAI#28032
11 unit tests for the nested-groups feature: - resolve_nested_groups: flat compat, single/multi-hop nesting, self- and indirect cycles (warning logged + safe degradation), propagation, pure-composition groups, diamond inheritance - _classify_member_names: router model takes precedence over group name when both match; mixed input splits into the three buckets - validate_models_exist: accepts known group names; reports truly unknown; backwards compatible when no groups passed All pure-function tests - no DB / no fake Prisma client needed. Refs BerriAI#28032
Previously the visited set tracked every node ever seen during a top-level resolution. That meant DAG-shared subtrees (A -> [B, C], B -> D, C -> D) triggered a spurious 'cycle detected' warning on D's second visit. Switch to on-path tracking: discard the node from visited when its frame unwinds. Real cycles still fire the warning; DAG re-traversal is silent. Caller already dedups the final list. Adds a regression test (test_dag_shared_group_subtree_no_false_cycle_warning) that asserts no warning is logged for the legitimate DAG case. Refs BerriAI#28032
Adds 22 tests covering pathological/edge inputs to lock the algorithm
against silent regressions:
- empty inputs: all_models=[], group_memberships={}, dangling groups
- defensive: unknown seed, membership pointing to missing child,
null router, empty child list
- cycles: 3-hop (A->B->C->A), self-ref nested deep in chain, message
identifies the right node
- scale: 50-level linear chain (no stack overflow), 100-child fanout
- DAG / dedup: shared subtrees produce no false warnings, duplicate
top-level groups expand independently, visited isolation between
top-level calls
- format: unicode/special-char group names, classify preserves input
order for error legibility
- equivalence: group_memberships=None vs {} behave identically
Total: 35 tests against pure functions, no DB needed.
Refs BerriAI#28032
ruff was unhappy with update_access_group at 52 statements. extracted two helpers that create/update/delete all needed anyway: - _clear_access_group_from_all_deployments (the deployment-tag loop) - _dual_write_group_membership (classify then route to deployments or memberships) endpoints are shorter now. PLR0915 happy. refs BerriAI#28032
|
|
CI's recursive_detector forbids new recursive functions (it has caused CPU spikes in the past). Switching to an explicit stack with ENTER/EXIT frame markers preserves on-path cycle tracking: visited grows on ENTER, shrinks on EXIT. Real cycles still log+skip; DAG-shared subtrees re-traverse silently as before. Same algorithm, same O(V + E) complexity, no Python recursion overhead. All 35 tests pass unchanged. Refs BerriAI#28032
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
15 tests using a hand-rolled fake Prisma client (no Postgres required): - get_group_memberships_from_db: bucketing, empty table - upsert_group_memberships: empty-list noop, self-reference rejection, batch create_many with skip_duplicates=True - delete_group_membership_edges: OR-clause covers parent and child - _clear_access_group_from_all_deployments: only touches deployments carrying the tag, no-op when nothing matches - _dual_write_group_membership: real-models path, child-groups path, mixed input, null router - get_all_access_groups_from_db: flat-only (backwards compat), nested expansion of model_names + parent/child fields, pure-composition groups (in membership table only, no deployment tags) Total nested-group test count: 50 (35 pure-function + 15 async DB). Refs BerriAI#28032
8 existing tests broke because they pass a plain MagicMock as the
prisma_client and the new awaited find_many trips on await.
same can hit prod - the proxy can boot before litellm-proxy-extras
finishes prisma migrate deploy, so the model is briefly missing
from the prisma client.
wrap find_many in a narrow try/except (AttributeError, TypeError),
return {} on miss. auth path falls back to today's flat-group
behavior instead of 500'ing. two regression tests so we don't
forget.
refs BerriAI#28032
Greptile SummaryThis PR adds hierarchical model access groups to the LiteLLM proxy, allowing a group to include other groups as members with transitive DFS resolution and cycle-safe skipping. All five issues raised in the previous review round have been addressed:
Confidence Score: 5/5All five previously flagged defects are remediated; the change is additive, backwards-compatible, and confined to listing endpoints rather than the per-request auth hot path. The iterative DFS is correct (on-path visited tracking, safe for DAG-shared subtrees), the TTL cache invalidates synchronously on every write within the same process, the broad exception catch in the DB helper ensures the auth path cannot 500, the No files require special attention. The migration and schema changes are consistent across all three Prisma schema copies.
|
| Filename | Overview |
|---|---|
| litellm/proxy/auth/model_checks.py | Adds iterative DFS resolve_nested_groups (cycle-safe, on-path visited tracking) and wires group_memberships kwarg through _get_models_from_access_groups, get_key_models, and get_team_models; callers already copy lists before passing, so in-place mutation is safe. |
| litellm/proxy/management_endpoints/model_access_group_management_endpoints.py | Adds TTL-cached membership map (60s, explicit invalidation on writes), validate_models_exist now checks known_access_groups even when llm_router is None, get_group_memberships_from_db catches broad Exception, and update_access_group Step 1b only clears edges when not in model_ids path — all five previously flagged issues addressed. |
| litellm/proxy/utils.py | Wires get_cached_group_memberships into get_available_models_for_user, gated on prisma_client is not None; falls back to empty dict preserving flat behavior when no DB is configured. |
| litellm/proxy/proxy_server.py | Adds the same TTL-cached membership fetch to model_info_v1 and passes group_memberships to get_key_models/get_team_models; change is confined to the listing endpoint, not the per-request auth path. |
| litellm-proxy-extras/litellm_proxy_extras/migrations/20260517100000_index_child_group_on_access_group_membership/migration.sql | Adds IF NOT EXISTS index on child_group column, fixing the previously flagged full-table-scan on delete_group_membership_edges; all three schema.prisma files include @@index([child_group]). |
| tests/test_litellm/proxy/auth/test_nested_access_groups_cache.py | Six tests cover cache hit/miss, explicit invalidation via upsert/delete, TTL expiry, within-TTL no-refetch, and error-path empty-dict caching; uses autouse fixture to reset module-level state between tests. |
| tests/test_litellm/proxy/auth/test_nested_access_groups.py | 35 pure-function tests covering flat compat, multi-hop nesting, diamond DAGs, cycle detection (self-ref, indirect, 3-hop), scale (50-level chain, 100-child fanout), and group_memberships=None vs {} equivalence; no real network calls. |
| litellm/types/proxy/management_endpoints/model_management_endpoints.py | Adds parent_groups and child_groups fields to AccessGroupInfo with default_factory=list; additive-only, fully backwards-compatible. |
Reviews (3): Last reviewed commit: "test(proxy): split nested-access-groups ..." | Re-trigger Greptile
four things from the bot review: 1. validate_models_exist bailed early with (False, all_names) when llm_router was None - never even checked known_access_groups. db-only setups couldn't create a group of groups at all. now we check known_groups even without a router. 2. get_group_memberships_from_db only caught AttributeError + TypeError, so a transient DB error would 500 every /v1/models call. broadened to Exception - the fallback now actually catches what the docstring says it does. 3. delete_group_membership_edges does WHERE parent OR child but only parent had an index. added @@index([child_group]) + a tiny migration. 4. get_group_memberships_from_db was being called on every model-listing request. added a 60s in-process TTL cache via get_cached_group_memberships, with explicit invalidation on every membership write. same shape as the existing llm_router.get_model_access_groups() cache. 9 tests for the above. refs BerriAI#28032
4f1252c to
c5e2c6c
Compare
|
@greptileai please re-review — addressed all four items from the previous review:
61 unit tests total covering pure-function, async-DB-helper, and cache (hit/miss/invalidation/TTL/error-cache) behavior. All green locally + on CI. |
|
🤖 litellm-agent: This PR is currently BLOCKED from merge. Score: 3/5 ❌ Why blocked:
Details: Score docked for: 1 PR-related CI failure (Size gate: 1 file(s) over 500 added LOC — split first (tests/test_litellm/proxy/auth/test_nested_access_groups.py (+577)). Add the Fix the issues above and push an update — the bot will re-review automatically.
|
update_access_group cleared parent->child edges unconditionally but only re-added them on the model_names path. update via model_ids would wipe the membership table for the group and never refill it - so a project-x -> [image, reasoning] composition vanished after a single model_ids update. gate the edge-clear behind not use_model_ids. adds two regression tests covering both branches.
577-LOC test file tripped the +500 added-LOC size gate. classify + validate coverage moves to a sibling file; the original keeps the DFS and edge-case suite. each file is now under 500 LOC.
|
🤖 litellm-agent: This PR is currently BLOCKED from merge. Score: 0/5 ❌ Why blocked:
Details: Score docked for: all Phase B agent checks non-approving (karpathy + security + coverage gap). Fix the issues above and push an update — the bot will re-review automatically.
|
|
@krrishdholakia @ishaan-jaff — ready for a look when you have a moment. Where it stands:
litellm-agent is sitting at 0/5 after the size gate cleared, with no per-agent detail on what Phase B caught (karpathy / security / coverage). For what it's worth, the codecov "missing lines" all sit in files this PR doesn't touch (mcp_server/server.py, prometheus.py, gemini transformation, user_api_key_auth_mcp.py) — pre-existing gaps in main that get attributed to anyone whose diff lands at the same time. Happy to address concrete feedback if the bot can surface specifics, but flagging in case it's a beta hiccup. PR scope recap: hierarchical access groups via new LiteLLM_AccessGroupMembership join table + iterative DFS resolver with on-path cycle tracking + 60s TTL cache on the hot read path. Backwards compatible — callers without the new group_memberships kwarg behave identically to today. |
Closes #28032.
What
Adds hierarchical model access groups so a group can include other groups as members. Permission changes on a child group propagate to every parent that includes it, with no manual fan-out.
Example:
A key/team assigned
project-xnow resolves to all four real models.Design
Storage
New
LiteLLM_AccessGroupMembership(parent_group, child_group)table holds parent->child edges. Existing tag-based groups inmodel_info.access_groupsare untouched - additive, fully backwards compatible.Migration:
litellm-proxy-extras/litellm_proxy_extras/migrations/20260516210000_add_access_group_membership/migration.sqlwithIF NOT EXISTSeverywhere. Synced across all threeschema.prismafiles per the CLAUDE.md rule.Resolution
resolve_nested_groups()does DFS expansion with on-path cycle detection. Cyclic edges are logged and skipped, not raised - this runs on the auth path and a malformed row must not 500 the proxy. On-path tracking (vs. ever-visited) means legitimate DAG-shared subtrees do NOT produce spurious cycle warnings._get_models_from_access_groups()gets a new optionalgroup_membershipskwarg, defaulted toNone. When absent/empty, behavior is bit-for-bit identical to today.Write path
validate_models_exist()now accepts optionalknown_access_groupsso a group's members can themselves be groups_classify_member_names()splits the caller's list into real models / child groups / unknownupdate_deployments_with_access_group()path (tags deployments)upsert_group_memberships()path (writes edges,create_many(skip_duplicates=True))parent == child) are rejected at write time with 400Both
create_model_groupandupdate_access_groupgo through a shared_dual_write_group_membership()helper.updateclears existing parent-edges before re-inserting.deletecleans up edges where this group appears on either side viadelete_group_membership_edges().Request-path activation
get_available_models_for_user()(utils.py) andmodel_info_v1()(proxy_server.py) now fetch the membership map once viaget_group_memberships_from_db()(single query, no N+1) and pass it through toget_key_models/get_team_models. Whenprisma_client is Nonethe map stays empty and behavior is identical to today.AccessGroupInfoAPITwo new fields, always present (empty lists when no edges), additive only:
get_all_access_groups_from_db()expandsmodel_namestransitively, surfaces pure-composition groups (groups that exist only in the membership table, no deployment tags), and populates parent/child fields.Tests
35 tests in
tests/test_litellm/proxy/auth/test_nested_access_groups.py, all pure functions, no DB / no fake Prisma client needed:group_memberships=Nonevs{}behave identicallyLocal checks
Backwards compatibility
_get_models_from_access_groupswithoutgroup_membershipsget_key_models,get_team_models,get_complete_model_listwithout new kwargvalidate_models_existwithoutknown_access_groupsAccessGroupInforesponse[]defaultsmodel_info["access_groups"]storageOut of scope (deliberate)
LiteLLM_AccessGroupparent table: would require migrating every existing tag into rows + rewriting every read path. The membership-table approach gets the feature in without that disruptionuser_api_key_auth: current activation is on the listing/info endpoints. Hot-path would need router-level caching of the membership map; can follow up if maintainers want itChecklist
tests/test_litellm/proxy/auth/test_nested_access_groups.py)make lintpasses for changed files (ruff clean, black clean on touched files)schema.prismafiles + migration addedcreate_many(skip_duplicates=True)for batch writes