feat(coderd): wire debug logging into chat lifecycle#23917
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1af282f556
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
42c633c to
acb43eb
Compare
3b28ca3 to
667de76
Compare
601aaa9 to
8eaae76
Compare
667de76 to
4bbd4ec
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4bbd4ec6d7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
4bbd4ec to
7936c1c
Compare
8eaae76 to
46bbcd9
Compare
7936c1c to
fc75844
Compare
46bbcd9 to
7e29de7
Compare
fc75844 to
48874e2
Compare
ebcbba1 to
df50d55
Compare
a7ff31c to
4f145c1
Compare
df50d55 to
0bd01b6
Compare
|
@codex review |
mafredri
left a comment
There was a problem hiding this comment.
Round 22 (panel). F20 addressed (91bbdf1, time.Now -> s.clock.Now). F18 closed: Netero verified the step handle uses a mutex with status priority checks, not sync.Once. The finding's premise was factually incorrect; later retries do overwrite earlier failures.
F17 (P2, test coverage) re-raised by panel vote 3/4 (Bisky, Mafuuu, Razor re-raise; Takumi defers). Author defense that "existing chat flow tests" cover the wiring is factually wrong: grep -rn 'debugSvc|debugEnabled|chatdebug' coderd/x/chatd/chatd_internal_test.go returns zero results. No chatd-level test enables debug mode. The contradiction is that this PR's own compaction_test.go adds direct tests for startCompactionDebugRun with mock DB expectations, while claiming implicit coverage suffices for the more complex chatd paths (lazy service init, edit/archive cleanup retries, panic recovery). This needs a human decision: file a ticket or add tests.
New from Netero: 3 Nits (em-dashes, Go 1.22+ loop capture), 1 P3 (dead code). New from panel: 2 P3s (ClassifyError untested, started_before SQL bound untested).
Concurrency verified clean by Takumi. The F11 ctx snapshot fix holds. Lazy debug service init via sync.Once + atomic store is correct. FinalizeRun's context detachment is correct. All retry loops are bounded.
"The author applied direct testing to the compaction debug path but claims implicit coverage suffices for the chatd debug paths. The chatd paths are more complex (lazy service init, edit/archive cleanup retries, panic recovery, ErrDynamicToolCall mapping) and more likely to contain latent bugs." -- Razor
🤖 This review was automatically generated with Coder Agents.
|
@codex review |
mafredri
left a comment
There was a problem hiding this comment.
Round 23. Churn guard verdict: BLOCKED.
DEREM-1 through DEREM-7 all addressed (516f680, 91eb54d). Em-dashes fixed, loop capture removed, dead code removed, ClassifyError tested, FinalizeRun tested, started_before SQL bound tested.
F17 (P2, chatd-level debug integration test coverage) remains silent. The panel re-raised this finding in R22 with a 3/4 vote. The author pushed fixes for DEREM-5, DEREM-6, and DEREM-7 (component-level tests) but did not respond to the chatd-level integration gap F17 identifies. No chatd test enables debug mode. This needs a human decision: add chatd-level debug integration tests, file a ticket, or explain why implicit coverage is sufficient (the prior defense was rejected by the panel with evidence).
Further review is blocked until the author responds to F17.
coderd/x/chatd/chatd.go:5980
P3 [DEREM-4] Author fixed (516f680). Redundant nil guard removed.
🤖
coderd/database/querier_test.go:11422
P3 [DEREM-7] Author fixed (91eb54d). started_before tests added.
🤖
🤖 This review was automatically generated with Coder Agents.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 91eb54de5b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cf8d0e4c51
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
mafredri
left a comment
There was a problem hiding this comment.
Round 24. Churn guard verdict: BLOCKED (3rd consecutive round since panel re-raise).
F17 (P2, chatd-level debug integration test coverage) remains the sole open finding. Silent for 3 rounds since the R22 panel re-raised it 3/4. The author pushed cf8d0e4 (modifies chatd.go debug cleanup logic) but added zero chatd-level debug integration tests and provided no response to F17.
CI is also failing (18 of 49 checks). This may need attention independent of F17.
This needs a human decision: add chatd-level debug integration tests, file a ticket to track the gap, or provide a new technical defense.
🤖 This review was automatically generated with Coder Agents.
|
@codex review |
mafredri
left a comment
There was a problem hiding this comment.
Round 25. Churn guard verdict: BLOCKED (4th consecutive round since R22 panel re-raise).
F17 (P2, chatd-level debug integration test coverage) remains the sole open finding. 4 more commits pushed since R24 modifying chatd.go, chatd_debug.go, and query files. Zero test additions. No author response to F17.
The panel's evidence stands: grep -rn 'debugSvc|debugEnabled|chatdebug' coderd/x/chatd/chatd_internal_test.go returns zero results. No chatd test enables debug mode. The author's prior defense ("tested implicitly through existing chat flow tests") was rejected by the panel 3/4 in R22.
This needs a human decision: add chatd-level debug integration tests, file a ticket, or provide a new defense.
🤖 This review was automatically generated with Coder Agents.
|
Codex Review: Didn't find any major issues. Already looking forward to the next diff. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3ddddcabff
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
mafredri
left a comment
There was a problem hiding this comment.
Round 26 (panel). F17 addressed: 240 lines of chatd-level integration tests verify debug cleanup wiring against real Postgres. All prior findings are now resolved, fixed, deferred with ticket (#24137), or panel-closed.
Netero: no findings. All 8 sections clean. Production-to-test ratio 1.4:1.
Panel (5 reviewers): Mafuuu, Pariston, Takumi, and Kite found no issues. Bisky found 2 P3s (row count verification in integration tests, shouldPublishFinishedChatState error path) and 2 Nits (time.Now in fixtures, test boilerplate). Concurrency verified clean by Takumi (sync.Once+atomic, inflightMu, panic recovery chain, ctx snapshot, FinalizeRun context detachment). Pariston evaluated 4 framings of the problem and concluded the solution is proportional and addresses the race at the correct causal level.
"I tried to build a case against this change and could not. The problem is correctly understood. The solution is proportional. The cleanup mechanism addresses the race at the right causal level with appropriate safety margins." -- Pariston
🤖 This review was automatically generated with Coder Agents.
|
@codex review |
mafredri
left a comment
There was a problem hiding this comment.
Round 27. Churn guard verdict: BLOCKED.
DEREM-9 (row count assertion) and DEREM-10 (error path test) both addressed in ac3cd17.
DEREM-12 (Nit, time.Now() in test fixtures instead of deterministic clock) and DEREM-13 (Nit, duplicated setup boilerplate in SQL tests) are silent. These are convention nits, not correctness issues. A response (fix, acknowledge, or explain) unblocks the review.
🤖 This review was automatically generated with Coder Agents.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ac3cd17a9b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 07db67955a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
mafredri
left a comment
There was a problem hiding this comment.
Round 28 (panel). All prior findings resolved, fixed, deferred (#24137), panel-closed, or author-accepted. DEREM-12 and DEREM-13 acknowledged with substantive reasoning.
Netero: no findings (all 8 sections clean). Panel (4 reviewers): Bisky and Takumi found no issues. Mafuuu found 1 P3, Razor found 1 Nit. Test quality praised by Bisky across all suites.
The PR is in excellent shape. Concurrency verified clean (Takumi R26, R28). Design validated as proportional (Pariston R26). Test coverage ratio 1.4:1 (Netero R26, R28). The sole new finding is a consistency gap in timeout handling.
"Oh, this suite is lovely. Round 28, and it shows." -- Bisky
🤖 This review was automatically generated with Coder Agents.
|
Codex Review: Didn't find any major issues. 🎉 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
mafredri
left a comment
There was a problem hiding this comment.
Round 29 (panel). DEREM-14 and DEREM-15 addressed (7d4c231). Zero open findings. CI green.
Netero: no findings (all 8 sections clean, code:test ratio 1.3:1). Panel (5 reviewers: Bisky, Mafuuu, Nami, Ging-React, Knov): no findings from any reviewer.
Knov verified all contracts: FinalizeRun consolidation, ClassifyError boundaries, clock-skew buffer arithmetic, scheduleDebugCleanup WaitGroup/retry bounds, shouldPublishFinishedChatState fail-open semantics, archive cleanup loop capture, CompactionOptions mutation ordering. All hold.
This PR is clean. Every finding raised across 29 rounds is resolved (code fixed), deferred with ticket (#24137), panel-closed, or author-accepted. No open issues remain.
"The contracts are honest and narrow." -- Knov
🤖 This review was automatically generated with Coder Agents.
Change-Id: I36789a7fd8d2b2a94a5ebce1a3b72726c614a2a4 Signed-off-by: Thomas Kosiewski <tk@coder.com>
Codex P2: the edit and archive cleanup cutoffs were sampled after the
transaction had already committed and pubsub events had fired. A
replacement turn (or unarchive) that raced ahead in the publish window
could start a debug run with started_at earlier than the cutoff,
letting the retry cleanup delete the replacement's fresh debug data.
Move both editCutoff and archiveCutoff above InTx so they are
guaranteed to precede the commit_time. Any replacement run must have
started_at > commit_time > cutoff, so the started_before bound on
Delete{AfterMessageID,ByChatID} continues to filter only pre-edit /
pre-archive rows.
Also fix the spellcheck lint regression by replacing "cancelled" with
"canceled" in the TestClassifyError table.
Change-Id: I5953035a6f09c9491d1db62feebe664215e06a6d
Signed-off-by: Thomas Kosiewski <tk@coder.com>
…uffer Codex (round 23) P2: editCutoff was sampled from this replica's local clock, while chat_debug_runs.started_at is stamped by whichever replica processes the replacement turn. If that worker's clock lags ours, the replacement's started_at can fall behind editCutoff and the retry cleanup ends up deleting current debug data. Source both cutoffs from the database row's updated_at returned by UpdateChatStatus / ArchiveChatByID, i.e. the same transaction-start NOW() timestamp the DB already uses, so the filter no longer depends on this replica's clock. Subtract a new debugCleanupClockSkew buffer (30s) so residual drift between the DB and the replica that stamps started_at cannot cause a false delete: rows that fall inside the buffer simply survive the fast retry and are handled by the existing FinalizeStale sweep. Change-Id: Ib7fb36784a7389cefff68988dbd1761b3eb35e2e Signed-off-by: Thomas Kosiewski <tk@coder.com>
Address the panel's F17 coverage gap: no chatd-level test exercised the debug logging wiring (lazy Server.debugService init, editCutoff / archiveCutoff sampling, scheduleDebugCleanup retry loop, message-id filter interaction) with debug mode actually enabled. Adds three integration tests driven through the public Server API against a real Postgres fixture. A new newDebugEnabledTestServer helper enables AlwaysEnableDebugLogs so IsEnabled returns true for every chat without seeding the admin/user opt-in settings tables. - TestEditMessageDebugCleanupDeletesPreEditRuns: seeds a stale debug run tied to the message-to-be-edited and an unrelated run on an earlier branch, calls EditMessage, drains the background goroutine via WaitUntilIdleForTest, and asserts the pre-edit run is deleted while the unrelated run survives the message-id filter. - TestEditMessageDebugCleanupPreservesRecentRuns: seeds a run whose started_at falls inside the 30s debugCleanupClockSkew buffer and asserts it survives the fast retry (deferred to the stale sweeper). - TestArchiveChatDebugCleanupDeletesPreArchiveRuns: asserts the archive path behaves symmetrically - stale runs outside the buffer are deleted, runs inside the buffer survive. Change-Id: Ic59fb53569c6ec7e079306a775712c2693a14c87 Signed-off-by: Thomas Kosiewski <tk@coder.com>
The deferred finalizer in generateCompactionSummary classified the run by the named err return, which stays nil when model.Generate panics. The panic is recovered higher up in processChat and the turn fails, yet the debug row was silently recorded as StatusCompleted in the exact crash path operators rely on to diagnose failures. Recover inside the defer, finalize the run with a panic-derived error (which ClassifyError maps to StatusError), and re-panic so the caller's existing recovery still observes the original value. Also address the P3 test-coverage gaps the R26 panel flagged: - Row-count survivor assertion in the three chatd-level debug cleanup integration tests so the fast-retry path's delete is verified directly, not just by negative lookup. Scoped to seeded IDs so a processor-started chat_turn run (triggered by the pending status transition) doesn't mask the assertion. - TestShouldPublishFinishedChatState_DBErrorPublishes pins the deliberate fail-open behavior on GetChatByID error. - TestGenerateCompactionSummary_PanicFinalizesAsError exercises the new panic path end to end: FakeModel panics, the deferred recover fires FinalizeRun with StatusError, and the panic still reaches the caller. Change-Id: Id6d26c806686f4d0aadf80f1f8a4dcbf9244fcdb Signed-off-by: Thomas Kosiewski <tk@coder.com>
Debug instrumentation must not be able to stall a user turn when the DB is slow or locked. Two paths still ran the debug insert on the same context that carried the user turn's (or compaction's) time budget: - prepareChatTurnDebugRun called debugSvc.CreateRun on the main chat processing context before chatloop.Run started, so a slow insert delayed the first model call. - startCompactionDebugRun reused the compaction summaryCtx (which already carries the compaction timeout), so a slow insert consumed most of the budget and pushed model.Generate into deadline exceeded. Match the manual-title pattern: detach via context.WithoutCancel and apply a short bounded timeout (5s) for the insert only. Debug persistence failures now degrade silently (no debug row for that turn) instead of degrading the turn itself. Change-Id: Ie32108faaa1eb22df13d343f7730c6a8bc812cbb Signed-off-by: Thomas Kosiewski <tk@coder.com>
Apply the same detached-timeout pattern to the quickgen debug path: prepareQuickgenDebugCandidate ran debugSvc.CreateRun on the caller context (titleCtx / summaryCtx, which already carry a 30s quickgen budget), so a slow DB insert could eat most of that budget and delay title generation or push-summary dispatch per candidate. Wrap the insert in context.WithoutCancel + debugCreateRunTimeout (5s) matching the chat-turn, manual-title, and compaction paths. Change-Id: I44f7b52fce2efaac77903fc7e0e8c8431d84814b Signed-off-by: Thomas Kosiewski <tk@coder.com>

Summary
Instrument the chat server, chat loop, compaction, quickgen, and chat provider with debug run recording. Each chat turn, compaction, and title generation now creates debug runs with steps that capture normalized request/response data for diagnostics.
This is PR 5/9 in the chat debug logging stack.
Changes
coderd/x/chatd/chatd.go): injectschatdebug.Serviceinto the chat server, createschat_turndebug runs per processing cycle, and wires lifecycle cleanup intoEditMessage()andArchiveChat().coderd/x/chatd/chatloop/chatloop.go): creates logical debug steps per model round-trip withinRun(), propagates debug context into the wrapped model, and records retries as attempts inside the same step.coderd/x/chatd/chatloop/compaction.go): instrumentsgenerateCompactionSummary()with acompactiondebug run.coderd/x/chatd/quickgen.go): instruments title generation with atitle_generationdebug run.coderd/x/chatd/chatprovider/chatprovider.go): wires the recording HTTP transport and model wrapper when debug logging is enabled.Stack overview
Generated with
mux• Model:anthropic:claude-opus-4-6• Thinking:xhigh