fix(agent): re-emit lifecycle state on reconnect to new agent ID by wyrickre · Pull Request #25406 · coder/coder

wyrickre · 2026-05-15T20:27:05Z

Maintainer note: I should have announced intent on #18571 before coding per CONTRIBUTING. Flagging this early so review cycles aren't wasted on a fait accompli; if the design direction is wrong (e.g. you'd prefer a server-side fix, or a different signal than manifest.AgentID), happy to rework or close. The PR touches lifecycleLastReportedIndex semantics that TestAgent_ReconnectNoLifecycleReemit was added to assert; the change is gated on AgentID-change so same-row reconnects still don't re-emit, but the contract shift is worth a deliberate yes/no.

Rev2 (2026-05-19): addressed the coder-agents-review CHANGES_REQUESTED. P1 (data race on lifecycleLastReportedIndex): moved the rewind from resendLastLifecycleState into reportLifecycle and gated it on an atomic.Bool flag, so lifecycleLastReportedIndex stays single-writer. P2 (stale ChangedAt on replayed state): reporter refreshes the replayed entry's ChangedAt = dbtime.Now() before the rewind. P3 fixes: added Info log on the AgentID-change branch; SetManifestAgentID helpers are now exercised by a new e2e test (DEREM-2 dead code is now live). All six nits in agent_internal_test.go resolved. Merged coder:main (no rebase, no force-push).

Rev3 (2026-05-20): addresses the second coder-agents-review pass. DEREM-12 (P2, "flag consumed before RPC success"): the outer reportLifecycle loop now checks for pending work or a set resend flag before blocking on lifecycleUpdate, so a failed UpdateLifecycle followed by a same-AgentID reconnect no longer leaves the reporter blocked on a channel nobody will signal again. DEREM-13 (P3, "single-state replay collapses startedAt/readyAt"): the rewind now targets index 0 (not len-2) and refreshes ChangedAt on every post-Created entry with 1ms spacing, so the inner loop replays the full Starting -> Ready chain and the new row records distinct started_at and ready_at. DEREM-14/15/16 (nits): dropped the DEREM-6 review-artifact reference, trimmed single-writer restatements outside the canonical field doc, and reworded the test comment to match its context-guarded mock. Also fixed a flake in the new internal test: it had been synchronizing on lifecycleReported, which uses a drop-older-keep-latest pattern, so per-iteration counting was non-deterministic; sync barrier is now ctx.cancel + <-reporterDone (20x stress under -race clean). Merged latest coder:main to pick up the deletion of useWorkspaceCreationWatcher.ts, which had a stray emdash that was making lint/emdash flake against my branch.

Summary

Fixes #18571, "Workspaces occasionally get stuck in starting phase."

The agent tracks lifecycle transitions per-process via lifecycleLastReportedIndex, but the server tracks lifecycle per workspace_agent row and initializes every new row to created. When the agent process survives a build boundary (e.g. GCE suspend/resume on long-lived VMs), RefreshToken correctly reconnects to the new agent row, but the lifecycle reporter never re-emits transitions and the new row stays at created forever. With startup_script_behavior = "blocking", coder ssh hangs indefinitely on every subsequent build.

Mechanism

Build chore: Initial GHA workflow #1: agent emits created -> starting -> ready against row A. Workspace usable.
coder stop creates build chore: Initial database scaffolding #2 (transition=stop). GCE suspends the VM; agent process frozen mid-flight.
coder start creates build chore: Add golangci-lint and codecov #3 (transition=start). GCE resumes; agent thaws, RefreshToken exchanges instance identity for a new session token bound to row B (newly inserted by the provisioner for build chore: Add golangci-lint and codecov #3).
Agent reconnects with the new token. Server-side row B is at lifecycle_state="created", started_at=null, ready_at=null. Agent's in-memory lifecycleLastReportedIndex still points at index 2 (ready was the last thing reported, to row A). reportLifecycle's loop sees nothing new and emits nothing. Row B is stuck.

Empirical confirmation (build numbers, lifecycle_state values, agent process start times): see issue comment.

Fix

handleManifest is invoked on every new RPC connection and fetches a manifest whose AgentID field is the server-side workspace_agent.ID for the current build. Comparing the new manifest's AgentID against the previously-stored manifest's AgentID is a precise signal for "the workspace_agent row changed across this reconnect." When it changes, request a lifecycle replay so the reporter pushes the most recent state to the new row on its next wake-up.

Four changes in agent/agent.go:

New agent.lifecycleResendRequested atomic.Bool field. resendLastLifecycleState() is a 4-liner: Store(true) on the flag plus a non-blocking send on lifecycleUpdate. It does NOT mutate lifecycleLastReportedIndex; that field stays single-writer, only touched by reportLifecycle.
reportLifecycle's outer loop checks for pending work or a set resend flag before blocking on lifecycleUpdate (the pending-work guard recovers cleanly from RPC failure mid-replay). When the resend flag is set, it acquires lifecycleMu.Lock(), refreshes lifecycleStates[i].ChangedAt for every post-Created entry with 1ms spacing (now, now+1ms, now+2ms, ...), and sets lifecycleLastReportedIndex = 0 so the inner loop replays the full Starting -> terminal chain. The 1ms spacing makes the server record distinct started_at and ready_at instead of collapsing them to the same instant.
In handleManifest, after a.manifest.Swap(&manifest): if oldManifest != nil && oldManifest.AgentID != uuid.Nil && oldManifest.AgentID != manifest.AgentID, log the change and call resendLastLifecycleState(). Same-AgentID reconnects are untouched.
Tests covering the helper, the trigger, the reporter's resend handler, and an end-to-end reconnect: see below.

Backward compatibility

Server-side UpdateLifecycle (coderd/agentapi/lifecycle.go:110) writes unconditionally, so a replayed Ready does produce a DB write on the new workspace_agent row and overwrites ready_at. The replayed Starting carries a fresh ChangedAt 1ms earlier; the server's Starting branch sets started_at = changedAt unconditionally (lifecycle.go:96-97), and the subsequent Ready leaves it alone (the if !startedAt.Valid guard is on the Ready/StartTimeout/StartError branch, not on Starting). Net: a coherent started_at that precedes ready_at by 1ms. The new row records a 1ms startup phase for this build, which truthfully reflects "the agent attached and was immediately ready, no setup work" (the prior build did the actual work). The build-duration metric is guarded by sync.Once per agent connection (the emitMetricsOnce lives on the per-connection LifecycleAPI struct), so no duplicate emissions.

The existing TestAgent_ReconnectNoLifecycleReemit invariant (no re-emit on a transient reconnect within the same build) is preserved; it uses a single agentID throughout, so the new condition is false and resendLastLifecycleState is not called. A dedicated SameAgentID_DoesNotSetFlag subtest in TestHandleManifestResendsLifecycleOnNewAgentID asserts this explicitly.

Tests

TestResendLastLifecycleState: unit tests for the helper's new contract; sets lifecycleResendRequested, signals the channel, non-blocking when the channel is full.
TestHandleManifestResendsLifecycleOnNewAgentID: integration test using a minimal fake proto.DRPCAgentClient28. Asserts the flag is set on AgentID change and NOT set on same-AgentID reconnect, plus handleManifest never writes lifecycleLastReportedIndex.
TestReportLifecycleRewindReplaysAndRefreshes: internal test exercising the reporter's resend handler with a captured UpdateLifecycle request stream. Asserts the replay begins with Starting (so the server sets started_at), ends with Ready (so the server sets ready_at), and Ready.ChangedAt is strictly after Starting.ChangedAt. Validates the DEREM-13 fix.
TestAgent_ReconnectReplaysLifecycleOnNewAgentID: end-to-end test using Client.SetManifestAgentID to simulate a build boundary. Real agent, fake coordinator; swaps the manifest's AgentID, forces a reconnect, asserts both Starting and Ready counts grow. Passes under -race, exercising the concurrent path that would detect a regression of P1 (DEREM-3).
Added (*FakeAgentAPI).SetManifestAgentID and (*Client).SetManifestAgentID to agent/agenttest/client.go. Used by the e2e test above.

Scope

Limited to templates that preserve the agent process across stop/start (long-lived VMs with desired_status="SUSPENDED"). Ephemeral pod templates spawn a fresh agent process on every start and don't hit this; their new agent runs startup_script fresh and emits transitions through the normal path.

Test plan

gofmt -l agent/ clean
go vet ./agent/... clean
golangci-lint v1.64.8 run ./agent/ clean (locally with GOTOOLCHAIN=go1.26.2)
paralleltestctx and intxcheck clean
go test -race -count=5 -run 'TestResendLastLifecycleState|TestHandleManifestResendsLifecycleOnNewAgentID|TestReportLifecycleRewindReplaysAndRefreshes|TestAgent_ReconnectReplaysLifecycleOnNewAgentID' ./agent/ passes (~29s wall, no flakes)
go test -race -count=20 -run '^TestReportLifecycleRewindReplaysAndRefreshes$' ./agent/ passes (20/20 after the rev3 sync-barrier fix)
TestAgent_ReconnectNoLifecycleReemit regression check: same env-specific flake on pristine origin/main, so not a regression
Maintainer-side: full CI on the branch
Manual: reproduce stuck state on a GCE suspend-template workspace, apply this build of the agent, verify post-resume lifecycle_state advances to ready on the new build

AI involvement

Per coder/coder's AI contribution guidelines: this PR was drafted with AI assistance (Claude Code).

Investigation, diagnosis, and code: AI-assisted. The mechanism above was identified by reading agent/agent.go, coderd/agentapi/manifest.go, and coderd/agentapi/lifecycle.go against the empirical confirmation in the linked issue comment. The patch, the helper, the integration test, and the test fakes were AI-drafted, then reviewed and run locally.
Empirical evidence: human-generated. The build-number / lifecycle_state / agent-process-start-time data referenced in the linked issue comment was collected by hand on a real stuck workspace (rwyrick-debug) before any code was written.
Verification: go vet, gofmt, golangci-lint, and the new + adjacent unit tests were run locally; see the test plan above. End-to-end validation against a live GCE-suspended workspace is still owed (the unchecked manual item in the test plan); happy to do that on request before merge, or maintainers can drive it through their own dogfood path.
Accountability: I'm responsible for the content of this PR, regardless of how it was generated.

The agent tracks lifecycle transitions per-process via `lifecycleLastReportedIndex`, but the server tracks lifecycle per workspace_agent row and initializes every new row to "created". When the agent process survives a build boundary (e.g. GCE suspend/resume on long-lived VMs), `RefreshToken` correctly reconnects to the new agent row, but the lifecycle reporter never re-emits transitions and the new row stays at "created" forever -- `coder ssh` with `startup_script_behavior = "blocking"` hangs indefinitely. Compare `oldManifest.AgentID` to the freshly-fetched `manifest.AgentID` in `handleManifest`. When they differ, rewind `lifecycleLastReportedIndex` so the next `reportLifecycle` wake-up replays the most recent state to the new row. Same-AgentID reconnects are untouched, preserving the invariant covered by `TestAgent_ReconnectNoLifecycleReemit`. Fixes coder#18571

github-actions · 2026-05-15T20:27:18Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

wyrickre · 2026-05-15T20:39:32Z

I have read the CLA Document and I hereby sign the CLA

matifali · 2026-05-19T05:22:06Z

/coder-agents-review

coder-agents-review

Well-scoped fix for a subtle cross-build lifecycle drift. The diagnosis is precise, the mechanism is correct, the scope limitation (suspend/resume on long-lived VMs) is honestly documented, and the test plan distinguishes human-collected evidence from AI-generated code. The backwards compatibility analysis correctly identifies the sync.Once metric guard and the idempotency of re-sends. Good work.

One P1 (data race that can silently negate the fix), two P2 (missing integration test, stale timestamps on replayed state), two P3 (missing log, dead code), and six nits.

The P1 is the center of gravity: 9 of 14 reviewers independently found the same data race on lifecycleLastReportedIndex. resendLastLifecycleState writes under lifecycleMu.Lock(), but reportLifecycle reads and writes the same field without any lock. A concurrent lifecycle transition during reconnect (e.g. Close() triggering setLifecycle(ShuttingDown)) creates a window where the reporter's ++ overwrites the rewind, silently causing the exact bug this PR fixes. The cleanest fix preserves the single-writer invariant: use an atomic flag to request the rewind, and let reportLifecycle perform it.

Process note: the PR description claims "re-sending ready to a row that's already ready is a no-op." The server's UpdateWorkspaceAgentLifecycleStateByID (lifecycle.go:110) writes unconditionally. It's harmless but not a no-op; the new row gets a DB write and the readyAt timestamp is overwritten. The conclusion (backward compatibility holds) is correct, but the cited evidence is wrong.

"Before this PR, lifecycleLastReportedIndex was only touched by the reporter goroutine after initialization. Single-writer meant no race. This PR adds resendLastLifecycleState as a second writer, called from the handleManifest goroutine. The lock in resendLastLifecycleState is useless because reportLifecycle never checks it for those accesses." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

Addresses the coder-agents-review CHANGES_REQUESTED on coder#25406. P1 (DEREM-3): The previous patch introduced a data race on `lifecycleLastReportedIndex`. `resendLastLifecycleState` wrote the field under `lifecycleMu.Lock()`, but `reportLifecycle` reads and writes the same field without any lock. A concurrent `setLifecycle(ShuttingDown)` during reconnect could cause the reporter's `++` to overwrite the rewind, silently negating the fix. Replace the lock-protected write with an `atomic.Bool` (`lifecycleResendRequested`). `resendLastLifecycleState` now only sets the flag and signals the lifecycle channel; the actual rewind happens at the top of `reportLifecycle`'s outer loop, keeping that goroutine the sole writer of `lifecycleLastReportedIndex`. P2 (DEREM-6): When the reporter consumes the resend flag it now refreshes `lifecycleStates[len-1].ChangedAt` to `dbtime.Now()` before the rewind. This prevents the new workspace_agent row from receiving the original build's stale wall clock as its `started_at` and `ready_at`, which would have produced negative `build_duration` histogram samples and predates-`created_at` timestamps on every suspend/resume workspace. P3 (DEREM-5): Add an Info-level log on the AgentID-change branch in `handleManifest` so operators have a breadcrumb when the replay fires. P3 (DEREM-2): `Client.SetManifestAgentID` and the matching `FakeAgentAPI` method are now used by the new e2e test, killing the dead-code finding. Tests: - `TestResendLastLifecycleState` simplified to cover the helper's new contract (set flag + signal channel). - `TestHandleManifestResendsLifecycleOnNewAgentID` updated to assert the flag instead of the index. - `TestReportLifecycleRewindRefreshesChangedAt` new internal test exercising the reporter's rewind path and the timestamp refresh, validating P2. - `TestAgent_ReconnectReplaysLifecycleOnNewAgentID` new end-to-end test using `SetManifestAgentID` to simulate a build boundary; passes under `-race`, exercising the concurrent path that would detect a regression of P1. Also fixes review nits 1, 7, 8, 9, 10, 11 in `agent_internal_test.go` (em-dash punctuation, missing channel-signal assertion, `t.Context()`, `manifestOK` rename, `newIntegrationAgent` rename, behavior-describing subtest names).

…eemit-new-agent-id # Conflicts: # agent/agent_internal_test.go

wyrickre · 2026-05-19T22:43:30Z

Rev2 pushed in 477bf0b + merge 5a8d001. Addressing the coder-agents-review findings:

DEREM	Pri	Status	Fix
3	P1	Addressed; left open for design feedback	New `lifecycleResendRequested atomic.Bool`. `resendLastLifecycleState` only sets the flag + signals the channel; the actual rewind happens at the top of `reportLifecycle`'s outer loop under `lifecycleMu.Lock()`. `lifecycleLastReportedIndex` stays single-writer (only `reportLifecycle` touches it).
6	P2	Addressed; left open for design feedback	When the reporter consumes the resend flag it now sets `lifecycleStates[len-1].ChangedAt = dbtime.Now()` under the lock before the rewind. The replay arrives at the new row with a timestamp coherent with its `created_at`, so `started_at`/`ready_at` no longer go negative and the `coderd_agents_build_duration_seconds` histogram stays clean.
4	P2	Resolved	Two new tests. `TestReportLifecycleRewindRefreshesChangedAt` (internal) runs `reportLifecycle` against a captured `UpdateLifecycle` and asserts the replayed `Ready` carries a fresh `ChangedAt`. `TestAgent_ReconnectReplaysLifecycleOnNewAgentID` (external) runs the full agent with a `FakeCoordinator`, swaps `manifest.AgentID` mid-flight, forces a reconnect, and asserts `Ready` is re-reported. Passes under `-race` (5x stress).
5	P3	Resolved	`a.logger.Info(ctx, "agent ID changed across reconnect, replaying lifecycle state", slog.F("old_agent_id", ...), slog.F("new_agent_id", ...))` added at the trigger point in `handleManifest`.
2	P3	Resolved	`Client.SetManifestAgentID` and `FakeAgentAPI.SetManifestAgentID` are now used by the new `TestAgent_ReconnectReplaysLifecycleOnNewAgentID`. No longer dead code.
1	Nit	Resolved	The `--` punctuation in the `fakeAgentAPIClient` comment is now "which is fine. The test only exercises..." per AGENTS.md.
7	Nit	Resolved	`TestResendLastLifecycleState` was simplified to the helper's new contract; the architectural-detail subtests (including the `ShuttingDown` one that skipped the channel check) are superseded by `TestReportLifecycleRewindRefreshesChangedAt`, which covers behavior across all states with consistent assertions.
8	Nit	Resolved	`newIntegrationAgent` uses `t.Context()` instead of `context.WithCancel(context.Background())` + `t.Cleanup`.
9	Nit	Resolved	`mok` → `manifestOK` (matches the production variable name in `handleManifest`).
10	Nit	Resolved	`mkAgent` → `newIntegrationAgent`, consistent with the Go constructor convention used elsewhere.
11	Nit	Resolved	Subtest names rewritten to describe behavior: `NewAgentID_SetsResendFlag`, `SameAgentID_DoesNotSetFlag`, `TestReportLifecycleRewindRefreshesChangedAt`.

Process note from the review summary (separate from inline comments): the original PR description claimed "re-sending Ready to a row that's already Ready is a no-op." That was wrong — UpdateWorkspaceAgentLifecycleStateByID writes unconditionally. The PR description has been corrected; the backward-compatibility conclusion still holds because (after DEREM-6) the overwrite uses a coherent timestamp.

Verification locally:

go vet, gofumpt/gci, golangci-lint v1.64.8 (with GOTOOLCHAIN=go1.26.2), paralleltestctx, intxcheck all clean against ./agent/.
All new + adjacent lifecycle tests pass under -race, including a 5x stress run of the two race-sensitive tests.
coder:main merged (not rebased), no force-push.

Happy to iterate further on the P1/P2 design if the maintainer prefers a different approach — left those threads open intentionally.

matifali · 2026-05-20T06:04:21Z

/coder-agents-review

coder-agents-review

All 11 R1 findings addressed cleanly. The atomic.Bool flag pattern correctly resolves the P1 data race (DEREM-3) by preserving the single-writer invariant on lifecycleLastReportedIndex. The timestamp refresh (DEREM-6) makes the new row's started_at/ready_at coherent with created_at. The new e2e test (TestAgent_ReconnectReplaysLifecycleOnNewAgentID) exercises the concurrent path under -race. Thorough response.

One new P2, one P3, three nits.

The P2 is a design-level observation: Swap(false) at line 842 consumes the resend flag before the RPC succeeds. If UpdateLifecycle fails (connection drops), reportLifecycle returns an error and run() restarts. On the new connection, handleManifest sees the same AgentID (same build), does not re-trigger resend, and nobody signals lifecycleUpdate. The reporter blocks forever with a rewound index but no wake signal. This gap existed in rev1 too (channel signal was consumed the same way). It's worth knowing because the target scenario (suspend/resume) involves network instability, but it doesn't invalidate the fix for the primary path.

The P3 notes that replaying only the terminal state (Ready) means the new row gets startedAt == readyAt, producing 0s startup duration in cliui.Agent. This is a known tradeoff of single-state replay; replaying the full sequence would require protocol changes.

Process note: the PR description's "sync.Once per agent process" should say "per agent connection" since emitMetricsOnce lives on the server-side LifecycleAPI struct, which is per-connection.

"If UpdateLifecycle fails after Swap(false) consumes the flag, the next reconnect sees oldManifest.AgentID == manifest.AgentID, never re-triggers resend, and the new row stays at created." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

…eemit-new-agent-id

Address rev2 review feedback on coder#25406: - DEREM-12 (P2): Guard the outer reportLifecycle loop on pending work and the resend flag, so a failed UpdateLifecycle followed by a same-AgentID reconnect doesn't leave the reporter blocked on a channel nobody will signal again. - DEREM-13 (P3): Rewind to index 0 (not len-2) so the replay covers Starting and the terminal state, and refresh ChangedAt on every replayed entry with 1ms spacing. The new workspace_agent row now records distinct started_at and ready_at, eliminating the startedAt == readyAt degeneracy that displayed 0s startup duration. - DEREM-14, 15, 16 (nits): drop the DEREM-6 review-artifact comment, trim single-writer restatements outside the canonical field doc, and reword the test comment to match its context-guarded mock. Also fix a flake in the new internal test: it synchronized on lifecycleReported, but the reporter's send pattern there is drop-older-keep-latest, so per-iteration counting was non-deterministic. The sync barrier is now ctx.cancel + reporterDone, which is what the reporter actually guarantees.

wyrickre · 2026-05-20T09:59:49Z

Rev3 pushed in 78ef441 (commits on top of the rebased rev2). Status table for the second coder-agents-review pass:

DEREM	Pri	Status	Fix
12	P2	Addressed; left open for design feedback	The outer `reportLifecycle` loop now reads `pendingWork = lifecycleLastReportedIndex < len(lifecycleStates)-1` under RLock and checks `lifecycleResendRequested.Load()` before blocking on `lifecycleUpdate`. If either is true, the select is skipped so the inner loop drains the pending work immediately. This recovers from the failure mode the review described: a failed `UpdateLifecycle` followed by a same-AgentID reconnect no longer leaves the reporter wedged on a channel nobody will signal again. Chose this over "re-set the flag on RPC failure" because it also covers the case where the rewind already happened but the RPC failed mid-replay (the index alone is enough evidence of pending work).
13	P3	Addressed; left open for design feedback	Rewind now targets index 0 (not `len-2`), and `ChangedAt` is refreshed on every post-`Created` entry with 1ms spacing (`now`, `now+1ms`, `now+2ms`, ...). The inner loop replays the full `Starting → terminal` chain. Result on the new row: `started_at = T+0`, `ready_at = T+1ms`, so `started_at < ready_at` and `cliui.Agent` shows a coherent (if tiny) startup duration rather than 0s. The 1ms is the minimum interval that survives DB timestamp precision; treating it as "this build did no setup work" is the truthful framing. Build-duration metric is unaffected (still `readyAt - build.CreatedAt` ≈ 0s, which is correct: this build genuinely did no setup work).
14	Nit	Resolved	Dropped the `DEREM-6 in the coder-agents-review on coder/coder#25406` reference from the test comment; the surrounding sentence stands on its own.
15	Nit	Resolved	Single-writer invariant statement is now only on the canonical `lifecycleLastReportedIndex` field comment. Trimmed from `resendLastLifecycleState`'s docstring, from the reporter's resend-block comment, and from the e2e test's docstring.
16	Nit	Resolved	Test comment now reads "context-guarded on send" to match the `select { case f.captured <- req: case <-ctx.Done(): ... }` mock pattern.

Also rolled in:

Test-flake fix. The internal test for the rewind path (TestReportLifecycleRewindReplaysAndRefreshes, renamed from ...RefreshesChangedAt) had been synchronizing on a.lifecycleReported. The reporter's send pattern there is drop-older-keep-latest (the select picks randomly between the send and the drain-and-send branches when both are ready), so per-iteration counting on the test side was ~80% reliable under -race. Sync barrier is now cancel(); <-reporterDone, which is what the reporter actually guarantees. 20x stress under -race clean.
Process note on backward-compatibility wording. The "build-duration metric is guarded by sync.Once per agent process" line in the PR description was inaccurate (emitMetricsOnce lives on the per-connection LifecycleAPI struct, not on the agent process). PR description fixed.
R1 follow-ups. The two threads I'd intentionally left open after rev2 (DEREM-3 P1 race, DEREM-6 P2 timestamp) are now resolved per "All 11 R1 findings addressed cleanly" in the rev2 of the review.
Latest coder:main merged (no rebase, no force-push). The merge brought in deletion of useWorkspaceCreationWatcher.ts, which had an emdash that was making lint/emdash false-fail against this branch.

Local verification:

gofmt -l agent/, go vet ./agent/..., golangci-lint v1.64.8 run ./agent/ (with GOTOOLCHAIN=go1.26.2), paralleltestctx, intxcheck: all clean.
go test -race -count=5 -run 'TestResendLastLifecycleState|TestHandleManifestResendsLifecycleOnNewAgentID|TestReportLifecycleRewindReplaysAndRefreshes|TestAgent_ReconnectReplaysLifecycleOnNewAgentID' ./agent/: passes, ~29s wall.
go test -race -count=20 -run '^TestReportLifecycleRewindReplaysAndRefreshes$' ./agent/: 20/20 after the sync-barrier fix.
TestAgent_ReconnectNoLifecycleReemit still fails on this workstation, but reproduces against pristine origin/main, so it's the documented environmental flake (Linux OsFs / system SSH on this Cloud Workstation), not a regression.

Happy to iterate further on the DEREM-12 / DEREM-13 design choices if the maintainer prefers a different shape. Both threads intentionally left open for that conversation.

wyrickre · 2026-05-20T10:02:11Z

/coder-agents-review

matifali · 2026-05-20T12:43:14Z

/coder-agents-review

coder-agents-review

Three rounds, 16 findings (1 P1, 3 P2, 3 P3, 9 Nit), all addressed. The fix is clean.

R1 caught a P1 data race (9/14 reviewers converged), a P2 test gap, a P2 stale-timestamp issue, and assorted nits. R2 found a P2 flag-consumption edge case and a P3 from single-state replay. R3: 13 reviewers, zero code-level findings. Every prior finding verified fixed.

The final shape is solid: atomic.Bool flag preserves single-writer on lifecycleLastReportedIndex, pending-work guard in the outer loop recovers from mid-replay RPC failure, full sequence replay with 1ms-spaced timestamps gives the new row coherent started_at/ready_at, and the e2e test exercises the concurrent path under -race.

Minor PR description note: the backward-compatibility section says the server "set started_at only if null" for the Starting replay. The server code (lifecycle.go:96-97) sets startedAt unconditionally on a Starting transition. The code comments in agent.go are accurate; only the prose in the PR description is slightly off. Not worth blocking on.

"The process quality is high; the work is genuine." (Mafu-san)

🤖 This review was automatically generated with Coder Agents.

wyrickre · 2026-05-20T18:05:48Z

PR description updated to fix the cosmetic item from the R3 review.

The Backward compatibility section previously said the server "set started_at only if null" on the replayed Starting. Per coderd/agentapi/lifecycle.go:95-97, the server's Starting branch sets startedAt = dbChangedAt unconditionally; the if !startedAt.Valid guard is on the Ready/StartTimeout/StartError branch. Net effect on the new row is the same (coherent started_at 1ms before ready_at), but the path is now described accurately.

wyrickre · 2026-05-22T20:43:18Z

@matifali - What are my next steps to get this mergeable ?

matifali · 2026-05-26T09:58:21Z

@wyrickre Thanks for your patience. I have shared that internally for rebiew. It might take a few days for us to review this

deansheather · 2026-05-26T10:10:04Z

I think a better approach would be to just completely reinitialize the agent after detecting the agent ID changing. With this patch, if changes were made to the agent manifest in the terraform, they probably wouldn't be applied by the old agent.

We have a system for reinitializing the agent in the agent code already, which gets triggered by a prebuild claim, but we could use similar logic here to force a reinitialization if the agent ID has changed upon reconnect.

github-actions Bot added the community Pull Requests and issues created by the community. label May 15, 2026

github-actions Bot assigned wyrickre May 15, 2026

cdrci2 added a commit to coder/cla that referenced this pull request May 15, 2026

@wyrickre has signed the CLA in coder/coder#25406

556ff65

wyrickre removed their assignment May 15, 2026

wyrickre mentioned this pull request May 15, 2026

bug: Workspaces occasionally get stuck in starting phase #18571

Open

1 task

wyrickre marked this pull request as draft May 18, 2026 19:18

wyrickre marked this pull request as ready for review May 18, 2026 19:19

coder-agents-review Bot suggested changes May 19, 2026

View reviewed changes

wyrickre added 2 commits May 19, 2026 16:24

Merge remote-tracking branch 'origin/main' into fix/18571-lifecycle-r…

d5596f8

…eemit-new-agent-id # Conflicts: # agent/agent_internal_test.go

coder-agents-review Bot reviewed May 20, 2026

View reviewed changes

Comment thread agent/agent.go

Comment thread agent/agent.go Outdated

Comment thread agent/agent_internal_test.go Outdated

Comment thread agent/agent.go Outdated

Comment thread agent/agent_internal_test.go Outdated

wyrickre added 2 commits May 20, 2026 03:12

Merge remote-tracking branch 'origin/main' into fix/18571-lifecycle-r…

17e1fe3

…eemit-new-agent-id

coder-agents-review Bot approved these changes May 20, 2026

View reviewed changes

github-actions Bot added the stale This issue is like stale bread. label Jun 3, 2026

github-actions Bot closed this Jun 7, 2026

Conversation

wyrickre commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Mechanism

Fix

Backward compatibility

Tests

Scope

Test plan

AI involvement

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wyrickre commented May 15, 2026

Uh oh!

matifali commented May 19, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wyrickre commented May 19, 2026

Uh oh!

matifali commented May 20, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wyrickre commented May 20, 2026

Uh oh!

wyrickre commented May 20, 2026

Uh oh!

matifali commented May 20, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

wyrickre commented May 20, 2026

Uh oh!

wyrickre commented May 22, 2026

Uh oh!

matifali commented May 26, 2026

Uh oh!

deansheather commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wyrickre commented May 15, 2026 •

edited

Loading

github-actions Bot commented May 15, 2026 •

edited

Loading