fix(supervisor): cancel pending delayed snapshots when the run completes or disconnects#3894
Conversation
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR adds runner-aware snapshot cancellation to prevent stale runner instances from cancelling snapshots after redeployment or reconnection. A new 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
a76626c to
829fec6
Compare
…nnects The compute suspend flow delays snapshots by snapshotDelayMs to avoid wasted work on short-lived waitpoints, with the intent that a run which continues before the delay expires cancels the pending snapshot. But the only cancel() call site is the /continue workload action, which runners only invoke when restoring from an already-taken snapshot - so a pending snapshot is never actually cancelled (zero snapshot.canceled events in prod). When a run resumes and completes within the delay window, the stale snapshot fires anyway and fcrun pauses the VM for ~6-13s while its controller is mid warm-start long-poll. The frozen guest can't fire its abort timer or send a FIN, so firestarter keeps the connection claimable past the client deadline and dispatches runs into it - each one a ~300s stall (TRI-10293). Cancel the pending snapshot when the attempt completes and when the run socket disconnects. Genuine waitpoint suspensions keep the runner socket connected and the attempt incomplete, so neither hook cancels a snapshot that is still wanted. Cancellation is guarded by runnerId so a stale duplicate runner for a reassigned run can't cancel the new runner's pending snapshot.
829fec6 to
2f5b8db
Compare
A runner calling attempts/complete has finished executing either way - on a transient completion failure it retries the call, so cancelling only on success leaves the stale snapshot armed in the meantime. The runnerId guard already covers the stale-duplicate-runner case whose completion fails server-side validation. Addresses CodeRabbit review feedback on the PR.
The timer wheel can tick during the awaited completeRunAttempt HTTP round trip, so cancelling after it leaves a window where a due snapshot dispatches anyway and pauses a VM that has moved on. Cancel first - the runnerId guard is ordering-independent, and the runner can't schedule a new suspend until it receives this route's reply, so nothing legitimate can be cancelled early. Matches the existing /continue ordering. Addresses Devin review feedback.
The compute suspend flow delays snapshots by
snapshotDelayMs(~30s) so short-lived waitpoints skip the snapshot entirely, with the intent that a run continuing before the delay expires cancels the pending snapshot. But the onlycancel()call site was the/continueaction, which runners only invoke when restoring from an already-taken snapshot — so pending snapshots were never cancelled (zerosnapshot.canceledevents ever emitted in prod). When a run resumed and completed inside the window, the stale snapshot fired ~30s later anyway, pausing the VM 6–13s mid warm-start long-poll; the frozen guest couldn't fire its abort timer or send a FIN, causing stalls and run-engine driven retries.Change
attempt.complete— after the platform accepts the completion, before the HTTP reply (so it can't reorder with the runner's next/suspend).runDisconnected(crash, exit, or run replaced on the socket).TimerWheel.peek()): a stale duplicate runner for a reassigned run must not cancel the fresh runner's pending snapshot. A missing runnerId falls through to an unconditional cancel (the pre-existing/continuebehavior is unchanged).Waitpoint suspensions keep the runner socket connected and the attempt incomplete, so neither hook touches a snapshot that is still wanted.
Known limitation (fail-safe direction):
socket.data.runnerIdis frozen at the websocket handshake, so after a same-supervisor restore the disconnect-path guard refuses the cancel. Theattempt.completepath uses the runner's current header id and is unaffected.