Skip to content

fix(agent): re-emit lifecycle state on reconnect to new agent ID#25406

Closed
wyrickre wants to merge 5 commits into
coder:mainfrom
wyrickre:fix/18571-lifecycle-reemit-new-agent-id
Closed

fix(agent): re-emit lifecycle state on reconnect to new agent ID#25406
wyrickre wants to merge 5 commits into
coder:mainfrom
wyrickre:fix/18571-lifecycle-reemit-new-agent-id

Conversation

@wyrickre

@wyrickre wyrickre commented May 15, 2026

Copy link
Copy Markdown

Maintainer note: I should have announced intent on #18571 before coding per CONTRIBUTING. Flagging this early so review cycles aren't wasted on a fait accompli; if the design direction is wrong (e.g. you'd prefer a server-side fix, or a different signal than manifest.AgentID), happy to rework or close. The PR touches lifecycleLastReportedIndex semantics that TestAgent_ReconnectNoLifecycleReemit was added to assert; the change is gated on AgentID-change so same-row reconnects still don't re-emit, but the contract shift is worth a deliberate yes/no.

Rev2 (2026-05-19): addressed the coder-agents-review CHANGES_REQUESTED. P1 (data race on lifecycleLastReportedIndex): moved the rewind from resendLastLifecycleState into reportLifecycle and gated it on an atomic.Bool flag, so lifecycleLastReportedIndex stays single-writer. P2 (stale ChangedAt on replayed state): reporter refreshes the replayed entry's ChangedAt = dbtime.Now() before the rewind. P3 fixes: added Info log on the AgentID-change branch; SetManifestAgentID helpers are now exercised by a new e2e test (DEREM-2 dead code is now live). All six nits in agent_internal_test.go resolved. Merged coder:main (no rebase, no force-push).

Rev3 (2026-05-20): addresses the second coder-agents-review pass. DEREM-12 (P2, "flag consumed before RPC success"): the outer reportLifecycle loop now checks for pending work or a set resend flag before blocking on lifecycleUpdate, so a failed UpdateLifecycle followed by a same-AgentID reconnect no longer leaves the reporter blocked on a channel nobody will signal again. DEREM-13 (P3, "single-state replay collapses startedAt/readyAt"): the rewind now targets index 0 (not len-2) and refreshes ChangedAt on every post-Created entry with 1ms spacing, so the inner loop replays the full Starting -> Ready chain and the new row records distinct started_at and ready_at. DEREM-14/15/16 (nits): dropped the DEREM-6 review-artifact reference, trimmed single-writer restatements outside the canonical field doc, and reworded the test comment to match its context-guarded mock. Also fixed a flake in the new internal test: it had been synchronizing on lifecycleReported, which uses a drop-older-keep-latest pattern, so per-iteration counting was non-deterministic; sync barrier is now ctx.cancel + <-reporterDone (20x stress under -race clean). Merged latest coder:main to pick up the deletion of useWorkspaceCreationWatcher.ts, which had a stray emdash that was making lint/emdash flake against my branch.

Summary

Fixes #18571, "Workspaces occasionally get stuck in starting phase."

The agent tracks lifecycle transitions per-process via lifecycleLastReportedIndex, but the server tracks lifecycle per workspace_agent row and initializes every new row to created. When the agent process survives a build boundary (e.g. GCE suspend/resume on long-lived VMs), RefreshToken correctly reconnects to the new agent row, but the lifecycle reporter never re-emits transitions and the new row stays at created forever. With startup_script_behavior = "blocking", coder ssh hangs indefinitely on every subsequent build.

Mechanism

  1. Build chore: Initial GHA workflow #1: agent emits created -> starting -> ready against row A. Workspace usable.
  2. coder stop creates build chore: Initial database scaffolding #2 (transition=stop). GCE suspends the VM; agent process frozen mid-flight.
  3. coder start creates build chore: Add golangci-lint and codecov #3 (transition=start). GCE resumes; agent thaws, RefreshToken exchanges instance identity for a new session token bound to row B (newly inserted by the provisioner for build chore: Add golangci-lint and codecov #3).
  4. Agent reconnects with the new token. Server-side row B is at lifecycle_state="created", started_at=null, ready_at=null. Agent's in-memory lifecycleLastReportedIndex still points at index 2 (ready was the last thing reported, to row A). reportLifecycle's loop sees nothing new and emits nothing. Row B is stuck.

Empirical confirmation (build numbers, lifecycle_state values, agent process start times): see issue comment.

Fix

handleManifest is invoked on every new RPC connection and fetches a manifest whose AgentID field is the server-side workspace_agent.ID for the current build. Comparing the new manifest's AgentID against the previously-stored manifest's AgentID is a precise signal for "the workspace_agent row changed across this reconnect." When it changes, request a lifecycle replay so the reporter pushes the most recent state to the new row on its next wake-up.

Four changes in agent/agent.go:

  1. New agent.lifecycleResendRequested atomic.Bool field. resendLastLifecycleState() is a 4-liner: Store(true) on the flag plus a non-blocking send on lifecycleUpdate. It does NOT mutate lifecycleLastReportedIndex; that field stays single-writer, only touched by reportLifecycle.
  2. reportLifecycle's outer loop checks for pending work or a set resend flag before blocking on lifecycleUpdate (the pending-work guard recovers cleanly from RPC failure mid-replay). When the resend flag is set, it acquires lifecycleMu.Lock(), refreshes lifecycleStates[i].ChangedAt for every post-Created entry with 1ms spacing (now, now+1ms, now+2ms, ...), and sets lifecycleLastReportedIndex = 0 so the inner loop replays the full Starting -> terminal chain. The 1ms spacing makes the server record distinct started_at and ready_at instead of collapsing them to the same instant.
  3. In handleManifest, after a.manifest.Swap(&manifest): if oldManifest != nil && oldManifest.AgentID != uuid.Nil && oldManifest.AgentID != manifest.AgentID, log the change and call resendLastLifecycleState(). Same-AgentID reconnects are untouched.
  4. Tests covering the helper, the trigger, the reporter's resend handler, and an end-to-end reconnect: see below.

Backward compatibility

Server-side UpdateLifecycle (coderd/agentapi/lifecycle.go:110) writes unconditionally, so a replayed Ready does produce a DB write on the new workspace_agent row and overwrites ready_at. The replayed Starting carries a fresh ChangedAt 1ms earlier; the server's Starting branch sets started_at = changedAt unconditionally (lifecycle.go:96-97), and the subsequent Ready leaves it alone (the if !startedAt.Valid guard is on the Ready/StartTimeout/StartError branch, not on Starting). Net: a coherent started_at that precedes ready_at by 1ms. The new row records a 1ms startup phase for this build, which truthfully reflects "the agent attached and was immediately ready, no setup work" (the prior build did the actual work). The build-duration metric is guarded by sync.Once per agent connection (the emitMetricsOnce lives on the per-connection LifecycleAPI struct), so no duplicate emissions.

The existing TestAgent_ReconnectNoLifecycleReemit invariant (no re-emit on a transient reconnect within the same build) is preserved; it uses a single agentID throughout, so the new condition is false and resendLastLifecycleState is not called. A dedicated SameAgentID_DoesNotSetFlag subtest in TestHandleManifestResendsLifecycleOnNewAgentID asserts this explicitly.

Tests

  • TestResendLastLifecycleState: unit tests for the helper's new contract; sets lifecycleResendRequested, signals the channel, non-blocking when the channel is full.
  • TestHandleManifestResendsLifecycleOnNewAgentID: integration test using a minimal fake proto.DRPCAgentClient28. Asserts the flag is set on AgentID change and NOT set on same-AgentID reconnect, plus handleManifest never writes lifecycleLastReportedIndex.
  • TestReportLifecycleRewindReplaysAndRefreshes: internal test exercising the reporter's resend handler with a captured UpdateLifecycle request stream. Asserts the replay begins with Starting (so the server sets started_at), ends with Ready (so the server sets ready_at), and Ready.ChangedAt is strictly after Starting.ChangedAt. Validates the DEREM-13 fix.
  • TestAgent_ReconnectReplaysLifecycleOnNewAgentID: end-to-end test using Client.SetManifestAgentID to simulate a build boundary. Real agent, fake coordinator; swaps the manifest's AgentID, forces a reconnect, asserts both Starting and Ready counts grow. Passes under -race, exercising the concurrent path that would detect a regression of P1 (DEREM-3).
  • Added (*FakeAgentAPI).SetManifestAgentID and (*Client).SetManifestAgentID to agent/agenttest/client.go. Used by the e2e test above.

Scope

Limited to templates that preserve the agent process across stop/start (long-lived VMs with desired_status="SUSPENDED"). Ephemeral pod templates spawn a fresh agent process on every start and don't hit this; their new agent runs startup_script fresh and emits transitions through the normal path.

Test plan

  • gofmt -l agent/ clean
  • go vet ./agent/... clean
  • golangci-lint v1.64.8 run ./agent/ clean (locally with GOTOOLCHAIN=go1.26.2)
  • paralleltestctx and intxcheck clean
  • go test -race -count=5 -run 'TestResendLastLifecycleState|TestHandleManifestResendsLifecycleOnNewAgentID|TestReportLifecycleRewindReplaysAndRefreshes|TestAgent_ReconnectReplaysLifecycleOnNewAgentID' ./agent/ passes (~29s wall, no flakes)
  • go test -race -count=20 -run '^TestReportLifecycleRewindReplaysAndRefreshes$' ./agent/ passes (20/20 after the rev3 sync-barrier fix)
  • TestAgent_ReconnectNoLifecycleReemit regression check: same env-specific flake on pristine origin/main, so not a regression
  • Maintainer-side: full CI on the branch
  • Manual: reproduce stuck state on a GCE suspend-template workspace, apply this build of the agent, verify post-resume lifecycle_state advances to ready on the new build

AI involvement

Per coder/coder's AI contribution guidelines: this PR was drafted with AI assistance (Claude Code).

  • Investigation, diagnosis, and code: AI-assisted. The mechanism above was identified by reading agent/agent.go, coderd/agentapi/manifest.go, and coderd/agentapi/lifecycle.go against the empirical confirmation in the linked issue comment. The patch, the helper, the integration test, and the test fakes were AI-drafted, then reviewed and run locally.
  • Empirical evidence: human-generated. The build-number / lifecycle_state / agent-process-start-time data referenced in the linked issue comment was collected by hand on a real stuck workspace (rwyrick-debug) before any code was written.
  • Verification: go vet, gofmt, golangci-lint, and the new + adjacent unit tests were run locally; see the test plan above. End-to-end validation against a live GCE-suspended workspace is still owed (the unchecked manual item in the test plan); happy to do that on request before merge, or maintainers can drive it through their own dogfood path.
  • Accountability: I'm responsible for the content of this PR, regardless of how it was generated.

The agent tracks lifecycle transitions per-process via
`lifecycleLastReportedIndex`, but the server tracks lifecycle per
workspace_agent row and initializes every new row to "created". When
the agent process survives a build boundary (e.g. GCE suspend/resume
on long-lived VMs), `RefreshToken` correctly reconnects to the new
agent row, but the lifecycle reporter never re-emits transitions and
the new row stays at "created" forever -- `coder ssh` with
`startup_script_behavior = "blocking"` hangs indefinitely.

Compare `oldManifest.AgentID` to the freshly-fetched `manifest.AgentID`
in `handleManifest`. When they differ, rewind
`lifecycleLastReportedIndex` so the next `reportLifecycle` wake-up
replays the most recent state to the new row. Same-AgentID reconnects
are untouched, preserving the invariant covered by
`TestAgent_ReconnectNoLifecycleReemit`.

Fixes coder#18571
@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@github-actions github-actions Bot added the community Pull Requests and issues created by the community. label May 15, 2026
@wyrickre

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

cdrci2 added a commit to coder/cla that referenced this pull request May 15, 2026
@wyrickre wyrickre removed their assignment May 15, 2026
@wyrickre wyrickre marked this pull request as draft May 18, 2026 19:18
@wyrickre wyrickre marked this pull request as ready for review May 18, 2026 19:19
@matifali

Copy link
Copy Markdown
Member

/coder-agents-review

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-scoped fix for a subtle cross-build lifecycle drift. The diagnosis is precise, the mechanism is correct, the scope limitation (suspend/resume on long-lived VMs) is honestly documented, and the test plan distinguishes human-collected evidence from AI-generated code. The backwards compatibility analysis correctly identifies the sync.Once metric guard and the idempotency of re-sends. Good work.

One P1 (data race that can silently negate the fix), two P2 (missing integration test, stale timestamps on replayed state), two P3 (missing log, dead code), and six nits.

The P1 is the center of gravity: 9 of 14 reviewers independently found the same data race on lifecycleLastReportedIndex. resendLastLifecycleState writes under lifecycleMu.Lock(), but reportLifecycle reads and writes the same field without any lock. A concurrent lifecycle transition during reconnect (e.g. Close() triggering setLifecycle(ShuttingDown)) creates a window where the reporter's ++ overwrites the rewind, silently causing the exact bug this PR fixes. The cleanest fix preserves the single-writer invariant: use an atomic flag to request the rewind, and let reportLifecycle perform it.

Process note: the PR description claims "re-sending ready to a row that's already ready is a no-op." The server's UpdateWorkspaceAgentLifecycleStateByID (lifecycle.go:110) writes unconditionally. It's harmless but not a no-op; the new row gets a DB write and the readyAt timestamp is overwritten. The conclusion (backward compatibility holds) is correct, but the cited evidence is wrong.

"Before this PR, lifecycleLastReportedIndex was only touched by the reporter goroutine after initialization. Single-writer meant no race. This PR adds resendLastLifecycleState as a second writer, called from the handleManifest goroutine. The lock in resendLastLifecycleState is useless because reportLifecycle never checks it for those accesses." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

Comment thread agent/agent.go Outdated
Comment thread agent/agent_internal_test.go
Comment thread agent/agent.go
Comment thread agent/agent.go Outdated
Comment thread agent/agenttest/client.go
Comment thread agent/agent_internal_test.go Outdated
Comment thread agent/agent_internal_test.go Outdated
Comment thread agent/agent_internal_test.go Outdated
Comment thread agent/agent_internal_test.go Outdated
Comment thread agent/agent_internal_test.go Outdated
wyrickre added 2 commits May 19, 2026 16:24
Addresses the coder-agents-review CHANGES_REQUESTED on coder#25406.

P1 (DEREM-3): The previous patch introduced a data race on
`lifecycleLastReportedIndex`. `resendLastLifecycleState` wrote the
field under `lifecycleMu.Lock()`, but `reportLifecycle` reads and
writes the same field without any lock. A concurrent
`setLifecycle(ShuttingDown)` during reconnect could cause the
reporter's `++` to overwrite the rewind, silently negating the fix.

Replace the lock-protected write with an `atomic.Bool`
(`lifecycleResendRequested`). `resendLastLifecycleState` now only
sets the flag and signals the lifecycle channel; the actual rewind
happens at the top of `reportLifecycle`'s outer loop, keeping that
goroutine the sole writer of `lifecycleLastReportedIndex`.

P2 (DEREM-6): When the reporter consumes the resend flag it now
refreshes `lifecycleStates[len-1].ChangedAt` to `dbtime.Now()`
before the rewind. This prevents the new workspace_agent row from
receiving the original build's stale wall clock as its `started_at`
and `ready_at`, which would have produced negative
`build_duration` histogram samples and predates-`created_at`
timestamps on every suspend/resume workspace.

P3 (DEREM-5): Add an Info-level log on the AgentID-change branch in
`handleManifest` so operators have a breadcrumb when the replay
fires.

P3 (DEREM-2): `Client.SetManifestAgentID` and the matching
`FakeAgentAPI` method are now used by the new e2e test, killing
the dead-code finding.

Tests:
- `TestResendLastLifecycleState` simplified to cover the helper's
  new contract (set flag + signal channel).
- `TestHandleManifestResendsLifecycleOnNewAgentID` updated to
  assert the flag instead of the index.
- `TestReportLifecycleRewindRefreshesChangedAt` new internal test
  exercising the reporter's rewind path and the timestamp
  refresh, validating P2.
- `TestAgent_ReconnectReplaysLifecycleOnNewAgentID` new end-to-end
  test using `SetManifestAgentID` to simulate a build boundary;
  passes under `-race`, exercising the concurrent path that would
  detect a regression of P1.

Also fixes review nits 1, 7, 8, 9, 10, 11 in `agent_internal_test.go`
(em-dash punctuation, missing channel-signal assertion, `t.Context()`,
`manifestOK` rename, `newIntegrationAgent` rename, behavior-describing
subtest names).
…eemit-new-agent-id

# Conflicts:
#	agent/agent_internal_test.go
@wyrickre

Copy link
Copy Markdown
Author

Rev2 pushed in 477bf0b + merge 5a8d001. Addressing the coder-agents-review findings:

DEREM Pri Status Fix
3 P1 Addressed; left open for design feedback New lifecycleResendRequested atomic.Bool. resendLastLifecycleState only sets the flag + signals the channel; the actual rewind happens at the top of reportLifecycle's outer loop under lifecycleMu.Lock(). lifecycleLastReportedIndex stays single-writer (only reportLifecycle touches it).
6 P2 Addressed; left open for design feedback When the reporter consumes the resend flag it now sets lifecycleStates[len-1].ChangedAt = dbtime.Now() under the lock before the rewind. The replay arrives at the new row with a timestamp coherent with its created_at, so started_at/ready_at no longer go negative and the coderd_agents_build_duration_seconds histogram stays clean.
4 P2 Resolved Two new tests. TestReportLifecycleRewindRefreshesChangedAt (internal) runs reportLifecycle against a captured UpdateLifecycle and asserts the replayed Ready carries a fresh ChangedAt. TestAgent_ReconnectReplaysLifecycleOnNewAgentID (external) runs the full agent with a FakeCoordinator, swaps manifest.AgentID mid-flight, forces a reconnect, and asserts Ready is re-reported. Passes under -race (5x stress).
5 P3 Resolved a.logger.Info(ctx, "agent ID changed across reconnect, replaying lifecycle state", slog.F("old_agent_id", ...), slog.F("new_agent_id", ...)) added at the trigger point in handleManifest.
2 P3 Resolved Client.SetManifestAgentID and FakeAgentAPI.SetManifestAgentID are now used by the new TestAgent_ReconnectReplaysLifecycleOnNewAgentID. No longer dead code.
1 Nit Resolved The -- punctuation in the fakeAgentAPIClient comment is now "which is fine. The test only exercises..." per AGENTS.md.
7 Nit Resolved TestResendLastLifecycleState was simplified to the helper's new contract; the architectural-detail subtests (including the ShuttingDown one that skipped the channel check) are superseded by TestReportLifecycleRewindRefreshesChangedAt, which covers behavior across all states with consistent assertions.
8 Nit Resolved newIntegrationAgent uses t.Context() instead of context.WithCancel(context.Background()) + t.Cleanup.
9 Nit Resolved mokmanifestOK (matches the production variable name in handleManifest).
10 Nit Resolved mkAgentnewIntegrationAgent, consistent with the Go constructor convention used elsewhere.
11 Nit Resolved Subtest names rewritten to describe behavior: NewAgentID_SetsResendFlag, SameAgentID_DoesNotSetFlag, TestReportLifecycleRewindRefreshesChangedAt.

Process note from the review summary (separate from inline comments): the original PR description claimed "re-sending Ready to a row that's already Ready is a no-op." That was wrong — UpdateWorkspaceAgentLifecycleStateByID writes unconditionally. The PR description has been corrected; the backward-compatibility conclusion still holds because (after DEREM-6) the overwrite uses a coherent timestamp.

Verification locally:

  • go vet, gofumpt/gci, golangci-lint v1.64.8 (with GOTOOLCHAIN=go1.26.2), paralleltestctx, intxcheck all clean against ./agent/.
  • All new + adjacent lifecycle tests pass under -race, including a 5x stress run of the two race-sensitive tests.
  • coder:main merged (not rebased), no force-push.

Happy to iterate further on the P1/P2 design if the maintainer prefers a different approach — left those threads open intentionally.

@matifali

Copy link
Copy Markdown
Member

/coder-agents-review

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 11 R1 findings addressed cleanly. The atomic.Bool flag pattern correctly resolves the P1 data race (DEREM-3) by preserving the single-writer invariant on lifecycleLastReportedIndex. The timestamp refresh (DEREM-6) makes the new row's started_at/ready_at coherent with created_at. The new e2e test (TestAgent_ReconnectReplaysLifecycleOnNewAgentID) exercises the concurrent path under -race. Thorough response.

One new P2, one P3, three nits.

The P2 is a design-level observation: Swap(false) at line 842 consumes the resend flag before the RPC succeeds. If UpdateLifecycle fails (connection drops), reportLifecycle returns an error and run() restarts. On the new connection, handleManifest sees the same AgentID (same build), does not re-trigger resend, and nobody signals lifecycleUpdate. The reporter blocks forever with a rewound index but no wake signal. This gap existed in rev1 too (channel signal was consumed the same way). It's worth knowing because the target scenario (suspend/resume) involves network instability, but it doesn't invalidate the fix for the primary path.

The P3 notes that replaying only the terminal state (Ready) means the new row gets startedAt == readyAt, producing 0s startup duration in cliui.Agent. This is a known tradeoff of single-state replay; replaying the full sequence would require protocol changes.

Process note: the PR description's "sync.Once per agent process" should say "per agent connection" since emitMetricsOnce lives on the server-side LifecycleAPI struct, which is per-connection.

"If UpdateLifecycle fails after Swap(false) consumes the flag, the next reconnect sees oldManifest.AgentID == manifest.AgentID, never re-triggers resend, and the new row stays at created." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

Comment thread agent/agent.go
Comment thread agent/agent.go Outdated
Comment thread agent/agent_internal_test.go Outdated
Comment thread agent/agent.go Outdated
Comment thread agent/agent_internal_test.go Outdated
wyrickre added 2 commits May 20, 2026 03:12
Address rev2 review feedback on coder#25406:

- DEREM-12 (P2): Guard the outer reportLifecycle loop on pending work
  and the resend flag, so a failed UpdateLifecycle followed by a
  same-AgentID reconnect doesn't leave the reporter blocked on a
  channel nobody will signal again.
- DEREM-13 (P3): Rewind to index 0 (not len-2) so the replay covers
  Starting and the terminal state, and refresh ChangedAt on every
  replayed entry with 1ms spacing. The new workspace_agent row now
  records distinct started_at and ready_at, eliminating the
  startedAt == readyAt degeneracy that displayed 0s startup duration.
- DEREM-14, 15, 16 (nits): drop the DEREM-6 review-artifact comment,
  trim single-writer restatements outside the canonical field doc,
  and reword the test comment to match its context-guarded mock.

Also fix a flake in the new internal test: it synchronized on
lifecycleReported, but the reporter's send pattern there is
drop-older-keep-latest, so per-iteration counting was
non-deterministic. The sync barrier is now ctx.cancel +
reporterDone, which is what the reporter actually guarantees.
@wyrickre

Copy link
Copy Markdown
Author

Rev3 pushed in 78ef441 (commits on top of the rebased rev2). Status table for the second coder-agents-review pass:

DEREM Pri Status Fix
12 P2 Addressed; left open for design feedback The outer reportLifecycle loop now reads pendingWork = lifecycleLastReportedIndex < len(lifecycleStates)-1 under RLock and checks lifecycleResendRequested.Load() before blocking on lifecycleUpdate. If either is true, the select is skipped so the inner loop drains the pending work immediately. This recovers from the failure mode the review described: a failed UpdateLifecycle followed by a same-AgentID reconnect no longer leaves the reporter wedged on a channel nobody will signal again. Chose this over "re-set the flag on RPC failure" because it also covers the case where the rewind already happened but the RPC failed mid-replay (the index alone is enough evidence of pending work).
13 P3 Addressed; left open for design feedback Rewind now targets index 0 (not len-2), and ChangedAt is refreshed on every post-Created entry with 1ms spacing (now, now+1ms, now+2ms, ...). The inner loop replays the full Starting → terminal chain. Result on the new row: started_at = T+0, ready_at = T+1ms, so started_at < ready_at and cliui.Agent shows a coherent (if tiny) startup duration rather than 0s. The 1ms is the minimum interval that survives DB timestamp precision; treating it as "this build did no setup work" is the truthful framing. Build-duration metric is unaffected (still readyAt - build.CreatedAt ≈ 0s, which is correct: this build genuinely did no setup work).
14 Nit Resolved Dropped the DEREM-6 in the coder-agents-review on coder/coder#25406 reference from the test comment; the surrounding sentence stands on its own.
15 Nit Resolved Single-writer invariant statement is now only on the canonical lifecycleLastReportedIndex field comment. Trimmed from resendLastLifecycleState's docstring, from the reporter's resend-block comment, and from the e2e test's docstring.
16 Nit Resolved Test comment now reads "context-guarded on send" to match the select { case f.captured <- req: case <-ctx.Done(): ... } mock pattern.

Also rolled in:

  • Test-flake fix. The internal test for the rewind path (TestReportLifecycleRewindReplaysAndRefreshes, renamed from ...RefreshesChangedAt) had been synchronizing on a.lifecycleReported. The reporter's send pattern there is drop-older-keep-latest (the select picks randomly between the send and the drain-and-send branches when both are ready), so per-iteration counting on the test side was ~80% reliable under -race. Sync barrier is now cancel(); <-reporterDone, which is what the reporter actually guarantees. 20x stress under -race clean.
  • Process note on backward-compatibility wording. The "build-duration metric is guarded by sync.Once per agent process" line in the PR description was inaccurate (emitMetricsOnce lives on the per-connection LifecycleAPI struct, not on the agent process). PR description fixed.
  • R1 follow-ups. The two threads I'd intentionally left open after rev2 (DEREM-3 P1 race, DEREM-6 P2 timestamp) are now resolved per "All 11 R1 findings addressed cleanly" in the rev2 of the review.
  • Latest coder:main merged (no rebase, no force-push). The merge brought in deletion of useWorkspaceCreationWatcher.ts, which had an emdash that was making lint/emdash false-fail against this branch.

Local verification:

  • gofmt -l agent/, go vet ./agent/..., golangci-lint v1.64.8 run ./agent/ (with GOTOOLCHAIN=go1.26.2), paralleltestctx, intxcheck: all clean.
  • go test -race -count=5 -run 'TestResendLastLifecycleState|TestHandleManifestResendsLifecycleOnNewAgentID|TestReportLifecycleRewindReplaysAndRefreshes|TestAgent_ReconnectReplaysLifecycleOnNewAgentID' ./agent/: passes, ~29s wall.
  • go test -race -count=20 -run '^TestReportLifecycleRewindReplaysAndRefreshes$' ./agent/: 20/20 after the sync-barrier fix.
  • TestAgent_ReconnectNoLifecycleReemit still fails on this workstation, but reproduces against pristine origin/main, so it's the documented environmental flake (Linux OsFs / system SSH on this Cloud Workstation), not a regression.

Happy to iterate further on the DEREM-12 / DEREM-13 design choices if the maintainer prefers a different shape. Both threads intentionally left open for that conversation.

@wyrickre

Copy link
Copy Markdown
Author

/coder-agents-review

1 similar comment
@matifali

Copy link
Copy Markdown
Member

/coder-agents-review

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three rounds, 16 findings (1 P1, 3 P2, 3 P3, 9 Nit), all addressed. The fix is clean.

R1 caught a P1 data race (9/14 reviewers converged), a P2 test gap, a P2 stale-timestamp issue, and assorted nits. R2 found a P2 flag-consumption edge case and a P3 from single-state replay. R3: 13 reviewers, zero code-level findings. Every prior finding verified fixed.

The final shape is solid: atomic.Bool flag preserves single-writer on lifecycleLastReportedIndex, pending-work guard in the outer loop recovers from mid-replay RPC failure, full sequence replay with 1ms-spaced timestamps gives the new row coherent started_at/ready_at, and the e2e test exercises the concurrent path under -race.

Minor PR description note: the backward-compatibility section says the server "set started_at only if null" for the Starting replay. The server code (lifecycle.go:96-97) sets startedAt unconditionally on a Starting transition. The code comments in agent.go are accurate; only the prose in the PR description is slightly off. Not worth blocking on.

"The process quality is high; the work is genuine." (Mafu-san)

🤖 This review was automatically generated with Coder Agents.

@wyrickre

Copy link
Copy Markdown
Author

PR description updated to fix the cosmetic item from the R3 review.

The Backward compatibility section previously said the server "set started_at only if null" on the replayed Starting. Per coderd/agentapi/lifecycle.go:95-97, the server's Starting branch sets startedAt = dbChangedAt unconditionally; the if !startedAt.Valid guard is on the Ready/StartTimeout/StartError branch. Net effect on the new row is the same (coherent started_at 1ms before ready_at), but the path is now described accurately.

@wyrickre

Copy link
Copy Markdown
Author

@matifali - What are my next steps to get this mergeable ?

@matifali

Copy link
Copy Markdown
Member

@wyrickre Thanks for your patience. I have shared that internally for rebiew. It might take a few days for us to review this

@deansheather

Copy link
Copy Markdown
Member

I think a better approach would be to just completely reinitialize the agent after detecting the agent ID changing. With this patch, if changes were made to the agent manifest in the terraform, they probably wouldn't be applied by the old agent.

We have a system for reinitializing the agent in the agent code already, which gets triggered by a prebuild claim, but we could use similar logic here to force a reinitialization if the agent ID has changed upon reconnect.

@github-actions github-actions Bot added the stale This issue is like stale bread. label Jun 3, 2026
@github-actions github-actions Bot closed this Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community Pull Requests and issues created by the community. stale This issue is like stale bread.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Workspaces occasionally get stuck in starting phase

3 participants