fix(agent): re-emit lifecycle state on reconnect to new agent ID#25406
fix(agent): re-emit lifecycle state on reconnect to new agent ID#25406wyrickre wants to merge 5 commits into
Conversation
The agent tracks lifecycle transitions per-process via `lifecycleLastReportedIndex`, but the server tracks lifecycle per workspace_agent row and initializes every new row to "created". When the agent process survives a build boundary (e.g. GCE suspend/resume on long-lived VMs), `RefreshToken` correctly reconnects to the new agent row, but the lifecycle reporter never re-emits transitions and the new row stays at "created" forever -- `coder ssh` with `startup_script_behavior = "blocking"` hangs indefinitely. Compare `oldManifest.AgentID` to the freshly-fetched `manifest.AgentID` in `handleManifest`. When they differ, rewind `lifecycleLastReportedIndex` so the next `reportLifecycle` wake-up replays the most recent state to the new row. Same-AgentID reconnects are untouched, preserving the invariant covered by `TestAgent_ReconnectNoLifecycleReemit`. Fixes coder#18571
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
|
/coder-agents-review |
There was a problem hiding this comment.
Well-scoped fix for a subtle cross-build lifecycle drift. The diagnosis is precise, the mechanism is correct, the scope limitation (suspend/resume on long-lived VMs) is honestly documented, and the test plan distinguishes human-collected evidence from AI-generated code. The backwards compatibility analysis correctly identifies the sync.Once metric guard and the idempotency of re-sends. Good work.
One P1 (data race that can silently negate the fix), two P2 (missing integration test, stale timestamps on replayed state), two P3 (missing log, dead code), and six nits.
The P1 is the center of gravity: 9 of 14 reviewers independently found the same data race on lifecycleLastReportedIndex. resendLastLifecycleState writes under lifecycleMu.Lock(), but reportLifecycle reads and writes the same field without any lock. A concurrent lifecycle transition during reconnect (e.g. Close() triggering setLifecycle(ShuttingDown)) creates a window where the reporter's ++ overwrites the rewind, silently causing the exact bug this PR fixes. The cleanest fix preserves the single-writer invariant: use an atomic flag to request the rewind, and let reportLifecycle perform it.
Process note: the PR description claims "re-sending ready to a row that's already ready is a no-op." The server's UpdateWorkspaceAgentLifecycleStateByID (lifecycle.go:110) writes unconditionally. It's harmless but not a no-op; the new row gets a DB write and the readyAt timestamp is overwritten. The conclusion (backward compatibility holds) is correct, but the cited evidence is wrong.
"Before this PR,
lifecycleLastReportedIndexwas only touched by the reporter goroutine after initialization. Single-writer meant no race. This PR addsresendLastLifecycleStateas a second writer, called from thehandleManifestgoroutine. The lock inresendLastLifecycleStateis useless becausereportLifecyclenever checks it for those accesses." (Hisoka)
🤖 This review was automatically generated with Coder Agents.
Addresses the coder-agents-review CHANGES_REQUESTED on coder#25406. P1 (DEREM-3): The previous patch introduced a data race on `lifecycleLastReportedIndex`. `resendLastLifecycleState` wrote the field under `lifecycleMu.Lock()`, but `reportLifecycle` reads and writes the same field without any lock. A concurrent `setLifecycle(ShuttingDown)` during reconnect could cause the reporter's `++` to overwrite the rewind, silently negating the fix. Replace the lock-protected write with an `atomic.Bool` (`lifecycleResendRequested`). `resendLastLifecycleState` now only sets the flag and signals the lifecycle channel; the actual rewind happens at the top of `reportLifecycle`'s outer loop, keeping that goroutine the sole writer of `lifecycleLastReportedIndex`. P2 (DEREM-6): When the reporter consumes the resend flag it now refreshes `lifecycleStates[len-1].ChangedAt` to `dbtime.Now()` before the rewind. This prevents the new workspace_agent row from receiving the original build's stale wall clock as its `started_at` and `ready_at`, which would have produced negative `build_duration` histogram samples and predates-`created_at` timestamps on every suspend/resume workspace. P3 (DEREM-5): Add an Info-level log on the AgentID-change branch in `handleManifest` so operators have a breadcrumb when the replay fires. P3 (DEREM-2): `Client.SetManifestAgentID` and the matching `FakeAgentAPI` method are now used by the new e2e test, killing the dead-code finding. Tests: - `TestResendLastLifecycleState` simplified to cover the helper's new contract (set flag + signal channel). - `TestHandleManifestResendsLifecycleOnNewAgentID` updated to assert the flag instead of the index. - `TestReportLifecycleRewindRefreshesChangedAt` new internal test exercising the reporter's rewind path and the timestamp refresh, validating P2. - `TestAgent_ReconnectReplaysLifecycleOnNewAgentID` new end-to-end test using `SetManifestAgentID` to simulate a build boundary; passes under `-race`, exercising the concurrent path that would detect a regression of P1. Also fixes review nits 1, 7, 8, 9, 10, 11 in `agent_internal_test.go` (em-dash punctuation, missing channel-signal assertion, `t.Context()`, `manifestOK` rename, `newIntegrationAgent` rename, behavior-describing subtest names).
…eemit-new-agent-id # Conflicts: # agent/agent_internal_test.go
|
Rev2 pushed in 477bf0b + merge 5a8d001. Addressing the
Process note from the review summary (separate from inline comments): the original PR description claimed "re-sending Ready to a row that's already Ready is a no-op." That was wrong — Verification locally:
Happy to iterate further on the P1/P2 design if the maintainer prefers a different approach — left those threads open intentionally. |
|
/coder-agents-review |
There was a problem hiding this comment.
All 11 R1 findings addressed cleanly. The atomic.Bool flag pattern correctly resolves the P1 data race (DEREM-3) by preserving the single-writer invariant on lifecycleLastReportedIndex. The timestamp refresh (DEREM-6) makes the new row's started_at/ready_at coherent with created_at. The new e2e test (TestAgent_ReconnectReplaysLifecycleOnNewAgentID) exercises the concurrent path under -race. Thorough response.
One new P2, one P3, three nits.
The P2 is a design-level observation: Swap(false) at line 842 consumes the resend flag before the RPC succeeds. If UpdateLifecycle fails (connection drops), reportLifecycle returns an error and run() restarts. On the new connection, handleManifest sees the same AgentID (same build), does not re-trigger resend, and nobody signals lifecycleUpdate. The reporter blocks forever with a rewound index but no wake signal. This gap existed in rev1 too (channel signal was consumed the same way). It's worth knowing because the target scenario (suspend/resume) involves network instability, but it doesn't invalidate the fix for the primary path.
The P3 notes that replaying only the terminal state (Ready) means the new row gets startedAt == readyAt, producing 0s startup duration in cliui.Agent. This is a known tradeoff of single-state replay; replaying the full sequence would require protocol changes.
Process note: the PR description's "sync.Once per agent process" should say "per agent connection" since emitMetricsOnce lives on the server-side LifecycleAPI struct, which is per-connection.
"If
UpdateLifecyclefails afterSwap(false)consumes the flag, the next reconnect seesoldManifest.AgentID == manifest.AgentID, never re-triggers resend, and the new row stays atcreated." (Hisoka)
🤖 This review was automatically generated with Coder Agents.
…eemit-new-agent-id
Address rev2 review feedback on coder#25406: - DEREM-12 (P2): Guard the outer reportLifecycle loop on pending work and the resend flag, so a failed UpdateLifecycle followed by a same-AgentID reconnect doesn't leave the reporter blocked on a channel nobody will signal again. - DEREM-13 (P3): Rewind to index 0 (not len-2) so the replay covers Starting and the terminal state, and refresh ChangedAt on every replayed entry with 1ms spacing. The new workspace_agent row now records distinct started_at and ready_at, eliminating the startedAt == readyAt degeneracy that displayed 0s startup duration. - DEREM-14, 15, 16 (nits): drop the DEREM-6 review-artifact comment, trim single-writer restatements outside the canonical field doc, and reword the test comment to match its context-guarded mock. Also fix a flake in the new internal test: it synchronized on lifecycleReported, but the reporter's send pattern there is drop-older-keep-latest, so per-iteration counting was non-deterministic. The sync barrier is now ctx.cancel + reporterDone, which is what the reporter actually guarantees.
|
Rev3 pushed in 78ef441 (commits on top of the rebased rev2). Status table for the second
Also rolled in:
Local verification:
Happy to iterate further on the DEREM-12 / DEREM-13 design choices if the maintainer prefers a different shape. Both threads intentionally left open for that conversation. |
|
/coder-agents-review |
1 similar comment
|
/coder-agents-review |
There was a problem hiding this comment.
Three rounds, 16 findings (1 P1, 3 P2, 3 P3, 9 Nit), all addressed. The fix is clean.
R1 caught a P1 data race (9/14 reviewers converged), a P2 test gap, a P2 stale-timestamp issue, and assorted nits. R2 found a P2 flag-consumption edge case and a P3 from single-state replay. R3: 13 reviewers, zero code-level findings. Every prior finding verified fixed.
The final shape is solid: atomic.Bool flag preserves single-writer on lifecycleLastReportedIndex, pending-work guard in the outer loop recovers from mid-replay RPC failure, full sequence replay with 1ms-spaced timestamps gives the new row coherent started_at/ready_at, and the e2e test exercises the concurrent path under -race.
Minor PR description note: the backward-compatibility section says the server "set started_at only if null" for the Starting replay. The server code (lifecycle.go:96-97) sets startedAt unconditionally on a Starting transition. The code comments in agent.go are accurate; only the prose in the PR description is slightly off. Not worth blocking on.
"The process quality is high; the work is genuine." (Mafu-san)
🤖 This review was automatically generated with Coder Agents.
|
PR description updated to fix the cosmetic item from the R3 review. The Backward compatibility section previously said the server "set |
|
@matifali - What are my next steps to get this mergeable ? |
|
@wyrickre Thanks for your patience. I have shared that internally for rebiew. It might take a few days for us to review this |
|
I think a better approach would be to just completely reinitialize the agent after detecting the agent ID changing. With this patch, if changes were made to the agent manifest in the terraform, they probably wouldn't be applied by the old agent. We have a system for reinitializing the agent in the agent code already, which gets triggered by a prebuild claim, but we could use similar logic here to force a reinitialization if the agent ID has changed upon reconnect. |
Summary
Fixes #18571, "Workspaces occasionally get stuck in starting phase."
The agent tracks lifecycle transitions per-process via
lifecycleLastReportedIndex, but the server tracks lifecycle perworkspace_agentrow and initializes every new row tocreated. When the agent process survives a build boundary (e.g. GCE suspend/resume on long-lived VMs),RefreshTokencorrectly reconnects to the new agent row, but the lifecycle reporter never re-emits transitions and the new row stays atcreatedforever. Withstartup_script_behavior = "blocking",coder sshhangs indefinitely on every subsequent build.Mechanism
created -> starting -> readyagainst row A. Workspace usable.coder stopcreates build chore: Initial database scaffolding #2 (transition=stop). GCE suspends the VM; agent process frozen mid-flight.coder startcreates build chore: Add golangci-lint and codecov #3 (transition=start). GCE resumes; agent thaws,RefreshTokenexchanges instance identity for a new session token bound to row B (newly inserted by the provisioner for build chore: Add golangci-lint and codecov #3).lifecycle_state="created",started_at=null,ready_at=null. Agent's in-memorylifecycleLastReportedIndexstill points at index 2 (readywas the last thing reported, to row A).reportLifecycle's loop sees nothing new and emits nothing. Row B is stuck.Empirical confirmation (build numbers, lifecycle_state values, agent process start times): see issue comment.
Fix
handleManifestis invoked on every new RPC connection and fetches a manifest whoseAgentIDfield is the server-sideworkspace_agent.IDfor the current build. Comparing the new manifest'sAgentIDagainst the previously-stored manifest'sAgentIDis a precise signal for "the workspace_agent row changed across this reconnect." When it changes, request a lifecycle replay so the reporter pushes the most recent state to the new row on its next wake-up.Four changes in
agent/agent.go:agent.lifecycleResendRequested atomic.Boolfield.resendLastLifecycleState()is a 4-liner:Store(true)on the flag plus a non-blocking send onlifecycleUpdate. It does NOT mutatelifecycleLastReportedIndex; that field stays single-writer, only touched byreportLifecycle.reportLifecycle's outer loop checks for pending work or a set resend flag before blocking onlifecycleUpdate(the pending-work guard recovers cleanly from RPC failure mid-replay). When the resend flag is set, it acquireslifecycleMu.Lock(), refresheslifecycleStates[i].ChangedAtfor every post-Createdentry with 1ms spacing (now,now+1ms,now+2ms, ...), and setslifecycleLastReportedIndex = 0so the inner loop replays the fullStarting -> terminalchain. The 1ms spacing makes the server record distinctstarted_atandready_atinstead of collapsing them to the same instant.handleManifest, aftera.manifest.Swap(&manifest): ifoldManifest != nil && oldManifest.AgentID != uuid.Nil && oldManifest.AgentID != manifest.AgentID, log the change and callresendLastLifecycleState(). Same-AgentID reconnects are untouched.Backward compatibility
Server-side
UpdateLifecycle(coderd/agentapi/lifecycle.go:110) writes unconditionally, so a replayedReadydoes produce a DB write on the newworkspace_agentrow and overwritesready_at. The replayedStartingcarries a freshChangedAt1ms earlier; the server'sStartingbranch setsstarted_at = changedAtunconditionally (lifecycle.go:96-97), and the subsequentReadyleaves it alone (theif !startedAt.Validguard is on theReady/StartTimeout/StartErrorbranch, not onStarting). Net: a coherentstarted_atthat precedesready_atby 1ms. The new row records a 1ms startup phase for this build, which truthfully reflects "the agent attached and was immediately ready, no setup work" (the prior build did the actual work). The build-duration metric is guarded bysync.Onceper agent connection (theemitMetricsOncelives on the per-connectionLifecycleAPIstruct), so no duplicate emissions.The existing
TestAgent_ReconnectNoLifecycleReemitinvariant (no re-emit on a transient reconnect within the same build) is preserved; it uses a singleagentIDthroughout, so the new condition is false andresendLastLifecycleStateis not called. A dedicatedSameAgentID_DoesNotSetFlagsubtest inTestHandleManifestResendsLifecycleOnNewAgentIDasserts this explicitly.Tests
TestResendLastLifecycleState: unit tests for the helper's new contract; setslifecycleResendRequested, signals the channel, non-blocking when the channel is full.TestHandleManifestResendsLifecycleOnNewAgentID: integration test using a minimal fakeproto.DRPCAgentClient28. Asserts the flag is set on AgentID change and NOT set on same-AgentID reconnect, plushandleManifestnever writeslifecycleLastReportedIndex.TestReportLifecycleRewindReplaysAndRefreshes: internal test exercising the reporter's resend handler with a capturedUpdateLifecyclerequest stream. Asserts the replay begins withStarting(so the server setsstarted_at), ends withReady(so the server setsready_at), andReady.ChangedAtis strictly afterStarting.ChangedAt. Validates the DEREM-13 fix.TestAgent_ReconnectReplaysLifecycleOnNewAgentID: end-to-end test usingClient.SetManifestAgentIDto simulate a build boundary. Real agent, fake coordinator; swaps the manifest'sAgentID, forces a reconnect, asserts bothStartingandReadycounts grow. Passes under-race, exercising the concurrent path that would detect a regression of P1 (DEREM-3).(*FakeAgentAPI).SetManifestAgentIDand(*Client).SetManifestAgentIDtoagent/agenttest/client.go. Used by the e2e test above.Scope
Limited to templates that preserve the agent process across stop/start (long-lived VMs with
desired_status="SUSPENDED"). Ephemeral pod templates spawn a fresh agent process on every start and don't hit this; their new agent runsstartup_scriptfresh and emits transitions through the normal path.Test plan
gofmt -l agent/cleango vet ./agent/...cleangolangci-lint v1.64.8 run ./agent/clean (locally withGOTOOLCHAIN=go1.26.2)paralleltestctxandintxcheckcleango test -race -count=5 -run 'TestResendLastLifecycleState|TestHandleManifestResendsLifecycleOnNewAgentID|TestReportLifecycleRewindReplaysAndRefreshes|TestAgent_ReconnectReplaysLifecycleOnNewAgentID' ./agent/passes (~29s wall, no flakes)go test -race -count=20 -run '^TestReportLifecycleRewindReplaysAndRefreshes$' ./agent/passes (20/20 after the rev3 sync-barrier fix)TestAgent_ReconnectNoLifecycleReemitregression check: same env-specific flake on pristineorigin/main, so not a regressionlifecycle_stateadvances toreadyon the new buildAI involvement
Per coder/coder's AI contribution guidelines: this PR was drafted with AI assistance (Claude Code).
agent/agent.go,coderd/agentapi/manifest.go, andcoderd/agentapi/lifecycle.goagainst the empirical confirmation in the linked issue comment. The patch, the helper, the integration test, and the test fakes were AI-drafted, then reviewed and run locally.lifecycle_state/ agent-process-start-time data referenced in the linked issue comment was collected by hand on a real stuck workspace (rwyrick-debug) before any code was written.go vet,gofmt,golangci-lint, and the new + adjacent unit tests were run locally; see the test plan above. End-to-end validation against a live GCE-suspended workspace is still owed (the unchecked manual item in the test plan); happy to do that on request before merge, or maintainers can drive it through their own dogfood path.