[superseded by #24642 + #24643] feat: auto-archive chats#24486
[superseded by #24642 + #24643] feat: auto-archive chats#24486johnstcn wants to merge 42 commits into
Conversation
Wire database.Chat into the audit system so chat lifecycle events
(manual patches, background auto-archives, deletions) produce audit
log entries.
- Add 'chat' value to the resource_type enum (migration 000473).
- Register Chat in the Auditable union, AuditActionMap, and
AuditableResources with per-field tracking that ignores churny
internal fields (status, heartbeat_at, updated_at,
last_model_config_id, ...) and tracks user-meaningful fields
(title, archived, pin_order, mode, labels, mcp_server_ids, ...).
- Implement ResourceTarget/ResourceID/ResourceType/ResourceRequiresOrgID
cases for database.Chat.
- Implement auditLogIsResourceDeleted via GetChatByID (chats are
hard-deleted; missing row == deleted).
- Implement auditLogResourceLink -> /chats/{id}.
- Wire audit.InitRequest[database.Chat] into the patchChat handler
so manual updates (title, archived, pin_order, labels, workspace
binding) produce audit records.
- Introduce BackgroundSubsystemChatAutoArchive for the upcoming
dbpurge integration, and broaden the background-audit log line
from 'dormancy audit' to 'background audit'.
- Regenerate audit-logs.md, swagger, schemas.md, and typesGenerated.ts.
Part of CODAGT-200.
6018381 to
ad22b73
Compare
9fd96ff to
e4a0406
Compare
e4a0406 to
a18186c
Compare
Adds the schema and queries needed to auto-archive inactive chats from dbpurge, following the plan in /tmp/plans/chat-auto-archive.md (not wired up yet — that's Phase 3). * New migration 000471 creates the per-owner digest dedupe log. * New chats.sql query AutoArchiveInactiveChats drives the whole show from a LATERAL MAX(chat_messages.created_at) over each family. All roles count; soft-deleted messages do not. * New siteconfig.sql queries for the 'agents_chat_auto_archive_days' knob, default 90, 0 disables. Mirrors the existing retention knob. * New chat_auto_archive_digest_log.sql for batch read + upsert of the per-owner dedupe rows. The actual dbpurge wiring, HTTP endpoints, notification template, and dispatch logic land in subsequent commits.
…igest_log TestMigrateUpWithFixtures asserts that every table has at least one row after running through all migrations. Migration 000473 creates chat_auto_archive_digest_log without seeding any data, so the test flagged it as empty. Seed a single dedupe row against the first available user so the table is covered.
…ions into one
Collapse 000473 (digest log table), 000474 (digest notification
template), and 000475 (partial index on chats) into a single
000473_chat_auto_archive.{up,down}.sql. Reviewers only have to
reason about one file; the three pieces are tightly coupled to the
same feature anyway.
notification_messages already deduplicates on (template_id, user_id, method, payload, day) via an MD5 hash trigger (migration 000245), which is exactly the guarantee our custom chat_auto_archive_digest_log table was providing - and users who find the digest noisy can disable the template entirely from their notification preferences. The custom 24h sliding window was only measurably better than native dedupe during a large initial backfill where the same owner gets hit across many ticks with payloads that happen to differ. That scenario needs a deployment with thousands of stale chats to matter; under early-access adoption (feature off by default, customers hop on deliberately), it does not. - Drop migration of chat_auto_archive_digest_log from 000473. - Drop GetChatAutoArchiveDigestLogsForOwners and UpsertChatAutoArchiveDigestLog queries, their dbauthz wrappers, and the dbauthz test cases. - Simplify dispatchChatAutoArchive to just audit + enqueue per owner; drop the unused db parameter. - Drop chatAutoArchiveDigestDedupe constant and the skipped_deduped / skipped_dedupe_error Prometheus labels. - Drop the migration fixture (table no longer exists). - Drop the DedupesDigestWithin24h dbpurge subtest and the dedupe-log assertion inside ArchivesInactiveRoot. - Update the auto-archive docs to describe native per-day dedupe and the per-template opt-out. Net: -288 LoC.
- Drop coderd_chat_auto_archive_digests_sent_total. The native
coderd_notifications_inflight_dispatches and
coderd_notifications_retry_count are both labeled with
notification_template_id, so filtering by the chat auto-archive
template ID gives the same data without an extra per-feature
metric. Only coderd_chat_auto_archive_records_archived_total
stays because nothing else counts chats archived per tick.
- Fix the duplicate greeting in the digest template. The SMTP and
webhook wrappers already prepend 'Hi {{.UserName}},', so the
body_template must not repeat it.
- Drop the 'View archived chats' action button. The
/chats?archived=true route isn't supported in the frontend yet
and adding that view is out of scope for this PR.
…count Deep review round 2 follow-up. - Drop stale dedupe-log references from constant, function, and dispatch-block comments. The custom dedupe table was removed in 5831e0d; comments now say what the code actually does. - Remove unused int32 parameters from buildDigests. They were retained as blank identifiers during simplification but never used; moved to buildDigestData which is the real consumer. - Compress verbose comments on constants, helpers, and types without losing meaningful content. dbpurge.go: 566 -> 544 LoC. - Render additional_archived_count in the digest body template. buildDigestData already computes the overflow count, but the template never consumed it, so a user archiving 100 chats at once would silently see only the first 25 titles with no hint more existed. Now the body ends with "...and N more." when applicable.
Deep review round 3. - Rename chatsAutoArchiveBatchSize to chatAutoArchiveBatchSize so all four chatAutoArchive* constants share a consistent naming convention (flagged by go-architect as a minor inconsistency). - Add TestAutoArchiveInactiveChats/DigestOverflowCap. The overflow branch of buildDigestData (cap titles at 25, surface extras via additional_archived_count) was untested; this archives 27 roots and asserts the digest payload caps at 25 titles with a "2" overflow count.
… auth Address review feedback: - Drop swagger annotations from the auto-archive-days handlers; document the endpoints in docs/ai-coder/agents/chats-api.md instead. Swagger will pick this up when the API stabilizes. - Grant UPDATE+READ on rbac.ResourceChat to subjectDBPurge and drop AsSystemRestricted from the auto-archive call; the ticker's AsDBPurge ctx now has everything it needs. - Replace the custom labelsFromRaw helper with StringMap.Scan, which already handles []byte and nil. - Compress the SQL comment block on AutoArchiveInactiveChats into a three-line header plus short inline annotations on the non-obvious predicates. - Trim verbose Go docstrings across dbpurge.go: New, the dispatch block, chatFromAutoArchiveRow, dispatchChatAutoArchive, buildDigests, buildDigestData, and the constant block.
…ctiveChats The comment referenced subjectSystemRestricted, but dbpurge now runs with subjectDBPurge (which was granted ResourceChat read+update in the same round).
…s retention clock DeleteOldChats and DeleteOldChatFiles both key off chats.updated_at, which AutoArchiveInactiveChats bumps to NOW() on archive (mirroring manual archive). Users/admins deserve to know the clock restarts. Follow-up tracked in CODAGT-222: add an archived_at column so the purge queries can stop using updated_at as a proxy.
a18186c to
015ee07
Compare
Renumber 000473_chat_auto_archive to 000474 after main landed 000473_mcp_server_allow_in_plan_mode. Conflict resolutions: - coderd/audit.go: take main's terser delete detection and the /agents/ route. - coderd/exp_chats.go: take main's inlined *api.Auditor.Load() and its post-update DB refresh before stamping aReq.New. - coderd/exp_chats_test.go: keep TestChatAutoArchiveDays and main's new Title subtest block; take main's nolint comment on TestUserChatCompactionThresholds. - enterprise/audit/table.go: take main's Chat action map (create, write) and plan_mode = ActionIgnore. - docs/admin/security/audit-logs.md: regenerated. - docs/ai-coder/agents/chats-api.md: keep Configuration section; take main's status table (adds requires_action). Also strip emdashes from lines surfaced by the new lint/emdash check (main added scripts/check_emdash.sh since our merge base): three in coderd/x/chatd/chatd.go and three in our docs.
Adds three subtests to TestAutoArchiveInactiveChats: - MultipleOwners: two owners in one org each get their own digest with only their chats; audit count matches archived roots. - SecondTickIdempotent: after tick 1 archives two roots, a third root seeded between ticks archives on tick 2 without re-archiving or re-notifying the first two. - BatchSizePagination: with batch size 20 and 27 stale roots, tick 1 archives 20, tick 2 archives the remaining 7, tick 3 is a no-op. To drive multi-tick tests deterministically, introduce a tickDriver helper that installs quartz traps *before* dbpurge.New so the forced initial tick and each loop iteration can be observed without racing with asynchronous ticker.Reset calls. tickDriver is separate from awaitDoTick because awaitDoTick conflates the forced tick and the first loop iteration, which is fine for idempotent single-tick tests but hides per-tick state when batch size is limiting. chatAutoArchiveBatchSize becomes a var (from const) so SetChatAutoArchiveBatchSizeForTest in export_test.go can shrink it for pagination tests without inserting thousands of rows.
|
/coder-agents-review |
There was a problem hiding this comment.
First-pass review (Netero only). 1 P2, 1 Nit. The full review panel has not yet reviewed this PR; it will review after these findings are addressed.
The auto-archive plumbing is well-structured: good test coverage ratio, correct batch pagination with idempotent second-tick verification, clean RBAC grant scoping on the dbpurge subject, and a sensible digest overflow cap. The chatFromAutoArchiveRow mapper and site_config pattern follow established conventions.
1 P2: notification template renders misleadingly when retention is disabled (retention_days=0).
1 Nit: dbauthz comment inaccurately describes access control.
"The notification tells users their data will be immediately deleted when it will not." (Netero)
🤖 This review was automatically generated with Coder Agents.
DEREM-1 (P2): the auto-archive digest body said "within 0 days, after which they will be permanently deleted" when retention_days=0, which per our docs means chats are kept indefinitely. Admins using auto_archive_days>0 with retention_days=0 would mislead every notified user. Branch the template on retention_days="0" to render the correct indefinite-retention text. Add a second golden fixture (TemplateChatAutoArchiveDigestRetentionZero) to capture that branch. DEREM-2 (Nit): GetChatAutoArchiveDays in dbauthz said it was "surfaced to any authenticated user via the experimental HTTP endpoint (mirroring GetChatRetentionDays)" but the HTTP handler is admin-only (policy.ActionUpdate on ResourceDeploymentConfig), unlike getChatRetentionDays which has no check. Correct the comment.
… landed 000474_drop_chat_model_config_provider_fk
|
/coder-agents-review |
There was a problem hiding this comment.
Second first-pass review (Netero only, round 2). R1 findings addressed; 1 new P2.
DEREM-1 and DEREM-2 from round 1 are both fixed. The template now branches correctly on retention_days=0, and the dbauthz comment accurately describes the admin gate.
New finding: the chat status docs table was rewritten but dropped requires_action, which is still defined and actively checked.
This is a first-pass review only. The full review panel has not yet reviewed this PR and will review after this finding is addressed.
"An API client reading the updated docs will not know what requires_action means when the API returns it." (Netero)
🤖 This review was automatically generated with Coder Agents.
…api): restore requires_action Address DEREM-3 (automatic review): restore 'requires_action' row in the chats-api.md status table. My prior merge conflict resolution accidentally dropped it while rewriting the column widths; the value is still a valid ChatStatus (codersdk/chats.go, exp_chats.go), so callers need the row in the docs. Extract newArchiveHarness for TestAutoArchiveInactiveChats subtests. Each subtest now reads ctx/clk/db/rawDB/logger/deps off a harness instead of repeating the same 7-line setup. Net diff-size is a wash (roughly neutral), but subtest bodies stand out more clearly.
|
/coder-agents-review |
There was a problem hiding this comment.
Full panel review (round 3, 21 reviewers). All prior findings (DEREM-1 through DEREM-3) confirmed fixed.
The architecture is clean: config read outside the transaction, bulk archive inside, post-commit dispatch for audits and notifications. The dependency direction is correct (dbpurge calls into audit and notifications, neither reaches back). The SQL CTE correctly separates root selection from cascade. Test coverage is thorough: disabled state, active/inactive distinction, pinned exemption, child cascading, idempotency, batch pagination, multi-owner digests, overflow caps, and golden fixtures for both retention branches.
3 P2, 4 P3, 2 Nit. The P2s: the archive query is blind to chat status (orphaning requires_action tool calls), the LATERAL subquery scales poorly on large deployments, and the key correctness property (child activity keeps root alive) has no test. The P3s: dispatch timeout arithmetic, template grammar, auditor entitlement bypass, and a doc/code contradiction.
"A chat whose status is running but whose last committed message predates the archive cutoff will be archived." (Razor)
🤖 This review was automatically generated with Coder Agents.
| require.Equal(t, deps.user.ID, sent[0].UserID) | ||
| }, | ||
| }, | ||
| { |
There was a problem hiding this comment.
P2 [DEREM-4] No test verifies that a child's recent message keeps the root alive. The LATERAL subquery computes family activity via fc.id = c.id OR fc.root_chat_id = c.id. If the fc.root_chat_id = c.id branch were accidentally dropped, roots with active children would be silently archived and every existing test would still pass.
ArchivesInactiveRoot puts messages on the root itself. SkipsPinnedAndChildren creates a stale root+child pair with no messages. Neither creates a root with old created_at whose child has a recent message, then asserts the root survives.
"If the LATERAL join were broken for the child-activity case, every existing test would still pass." (Meruem)
Sketch: create a stale root (no messages), link a child, insert a recent message on the child, run a tick, assert the root is NOT archived. (Meruem P2, Bisky P3, Kite P3, Ryosuke Note)
🤖
| WHERE (fc.id = c.id OR fc.root_chat_id = c.id) | ||
| AND cm.deleted = false | ||
| ) activity ON TRUE | ||
| WHERE c.archived = false |
There was a problem hiding this comment.
P2 [DEREM-5] The archive query has no status predicate. A chat in requires_action (agent invoked a client tool, waiting for the human to respond) or paused (waiting for user input) sits idle until the user returns. With the minimum allowed window of 1 day, a user who steps away overnight finds their chat archived, the agent's pending tool call orphaned, and no explanation why.
For running/pending, the gap is academic: a truly active agent produces messages continuously, keeping timestamps fresh. But requires_action and paused are designed to wait arbitrarily long.
Fix: add AND c.status NOT IN ('running', 'pending', 'paused', 'requires_action') to the to_archive CTE. The partial index idx_chats_auto_archive_candidates would need a matching predicate update. Alternatively, if the intent is truly message-only activity, update the docs (see DEREM-10). (Hisoka P2, Razor P3, Mafuuu P3)
🤖
| AND cm.deleted = false | ||
| ) activity ON TRUE | ||
| WHERE c.archived = false | ||
| AND c.pin_order = 0 |
There was a problem hiding this comment.
P2 [DEREM-6] The LATERAL subquery evaluates for every non-archived, non-pinned root before the LIMIT is applied. The partial index covers the WHERE predicates, but without c.created_at < @archive_cutoff in the CTE's WHERE, the planner does a full index scan instead of a range scan.
Since chat_messages.created_at >= chat.created_at is always true, c.created_at < cutoff is logically redundant but optimizer-critical. Without it: deployment with 100k active root chats, 90-day cutoff, ~99% created within 90 days = ~100k LATERAL evaluations inside the advisory-locked transaction. At ~3ms per LATERAL under load = ~300s holding the advisory lock. With it: ~1k LATERAL evaluations = ~3s.
Fix: add AND c.created_at < @archive_cutoff::timestamptz to the to_archive CTE's WHERE clause alongside the existing COALESCE filter. (Killua P2, Knov P3, Knuckle P3)
🤖
| } | ||
| } | ||
|
|
||
| // dispatchChatAutoArchive audits every archived chat and enqueues one digest |
There was a problem hiding this comment.
P3 [DEREM-7] The dispatch loop is synchronous: each BackgroundAudit call does GetUserByID + Export per archived chat. With 1000 roots and cascaded children (no cap on children per root), the slice can well exceed 1000. At 1000 roots with an average of 3 children, that is 4000 rows x 2 queries x 5ms = 40s against the 30s chatAutoArchiveDispatchTimeout. When the context expires, remaining audit entries and digest notifications are silently dropped while the archive itself is committed. The user's chats vanish from their list without the promised notification.
Realistic primarily on the first run against a large backlog. Could cache GetUserByID results per owner (repeated calls for same owner's chats), or batch with per-batch sub-deadlines. (Hisoka P3, Killua P3, Luffy P3)
🤖
| '764031be-4863-4220-867b-6ce1a1b7a5f5', | ||
| 'Chats Auto-Archived', | ||
| E'Chats auto-archived after {{.Data.auto_archive_days}} days of inactivity', | ||
| E'The following chat{{if ne (len .Data.archived_chats) 1}}s were{{else}} was{{end}} automatically archived because {{if ne (len .Data.archived_chats) 1}}they have{{else}}it has{{end}} been inactive for more than {{.Data.auto_archive_days}} days:\n\n{{range .Data.archived_chats}}* "{{.title}}" (last active {{.last_activity_humanized}})\n{{end}}{{with .Data.additional_archived_count}}...and {{.}} more.\n\n{{end}}\n{{if eq .Data.retention_days "0"}}You can restore any of them from the Chats page; archived chats are kept indefinitely.{{else}}You can restore any of them from the Chats page within {{.Data.retention_days}} days, after which they will be permanently deleted.{{end}}', |
There was a problem hiding this comment.
P3 [DEREM-8] Two template grammar issues. (1) The closing sentence is always plural: "You can restore any of them from the Chats page." When exactly one chat is archived, the singular intro ("The following chat was") is followed by "any of them," which reads broken. Fix: branch the closing too. Singular: "You can restore it." Plural: "You can restore any of them." (2) Day pluralization: {{.Data.auto_archive_days}} days and {{.Data.retention_days}} days both hardcode "days" plural. An admin who sets either value to 1 gets "after 1 days of inactivity." Fix: {{if eq .Data.auto_archive_days "1"}}day{{else}}days{{end}} in each slot. (Leorio P3)
🤖
| // The auditor and enqueuer drive the chat auto-archive audit entries and | ||
| // per-owner digest notifications; pass noop impls to disable those side | ||
| // effects without disabling the archive. | ||
| func New(ctx context.Context, logger slog.Logger, db database.Store, vals *codersdk.DeploymentValues, clk quartz.Clock, reg prometheus.Registerer, auditor audit.Auditor, enqueuer notifications.Enqueuer) io.Closer { |
There was a problem hiding this comment.
P3 [DEREM-9] dbpurge.New accepts audit.Auditor directly and stores it for the process lifetime. Other background subsystems (ActivateDormantUser in coderd/userauth.go:642, autobuild.NewExecutor) accept *atomic.Pointer[audit.Auditor] and call .Load() at invocation time, respecting the enterprise entitlement toggle that swaps between the real auditor and nop. When the audit-log feature is toggled off at runtime, api.AGPL.Auditor.Store(&nop) fires, but dbpurge still holds the original real auditor. Auto-archive audit entries keep flowing into audit_logs while every other subsystem stops auditing.
Fix: accept *atomic.Pointer[audit.Auditor] and .Load() on each tick inside dispatchChatAutoArchive. (Ryosuke P2, Meruem Note)
🤖
| alongside their parent so the conversation stays coherent. | ||
|
|
||
| Activity is defined as the most recent non-deleted message in the | ||
| conversation family, counting messages from every role. If an agent is |
There was a problem hiding this comment.
P3 [DEREM-10] "If an agent is still generating a response, the chat counts as active even if the human-authored message is old." This implies a status-aware check. The SQL query (AutoArchiveInactiveChats) determines activity solely via MAX(cm.created_at) on committed, non-deleted chat_messages. It never inspects chats.status. A chat whose status is running but whose last committed message predates the cutoff will be archived.
If the status filter from DEREM-5 is added, this doc claim becomes accurate. If not, reword to: "Activity is measured by the most recent non-deleted message. Chats with recent agent-generated messages remain active." (Razor P3, Hisoka Nit)
🤖
There was a problem hiding this comment.
A chat whose status is running but whose last committed message predates the cutoff will be archived.
Hard to think about how that's possible in practice, but defence in depth is probably no harm.
| var labels database.StringMap | ||
| // sqlc's StringMap override doesn't reach CTE-aliased columns, so Labels | ||
| // arrives as raw JSON bytes. StringMap.Scan handles []byte and nil. | ||
| _ = labels.Scan([]byte(r.Labels)) |
There was a problem hiding this comment.
Nit [DEREM-11] _ = labels.Scan([]byte(r.Labels)) discards parse errors. If Labels is ever corrupt, the audit entry records empty labels with no diagnostic trail. A logger.Warn on non-nil error would surface the issue without changing control flow. The comment explains the sqlc CTE limitation, but not the consequence of the discarded error. (Chopper, Gon, Kite, Knov, Meruem, Ryosuke, Razor)
🤖
| return q.db.ArchiveUnusedTemplateVersions(ctx, arg) | ||
| } | ||
|
|
||
| func (q *querier) AutoArchiveInactiveChats(ctx context.Context, arg database.AutoArchiveInactiveChatsParams) ([]database.AutoArchiveInactiveChatsRow, error) { |
There was a problem hiding this comment.
Nit [DEREM-12] Comment says "ActionRead on chat_messages is covered by the ResourceChat grant on subjectDBPurge." There is no RBAC resource for chat_messages. The ActionRead on ResourceChat authorizes reads of chats rows. The SQL LATERAL join on chat_messages happens below the RBAC boundary. More accurate: "The LATERAL read of chat_messages rows is covered by ActionRead on ResourceChat." (Mafuuu, Razor)
🤖
|
Superseded by two stacked PRs to stay under the ~1000 non-generated LoC per-PR ballpark:
Closing this one. Prior review comments on this PR have been addressed in the split; no need to re-review here.
|
Superseded by #24642 + #24643. Closed.