fix(supervisor): cancel pending delayed snapshots when the run completes or disconnects by myftija · Pull Request #3894 · triggerdotdev/trigger.dev

myftija · 2026-06-10T16:14:23Z

The compute suspend flow delays snapshots by snapshotDelayMs (~30s) so short-lived waitpoints skip the snapshot entirely, with the intent that a run continuing before the delay expires cancels the pending snapshot. But the only cancel() call site was the /continue action, which runners only invoke when restoring from an already-taken snapshot — so pending snapshots were never cancelled (zero snapshot.canceled events ever emitted in prod). When a run resumed and completed inside the window, the stale snapshot fired ~30s later anyway, pausing the VM 6–13s mid warm-start long-poll; the frozen guest couldn't fire its abort timer or send a FIN, causing stalls and run-engine driven retries.

Change

Cancel the pending snapshot on attempt.complete — after the platform accepts the completion, before the HTTP reply (so it can't reorder with the runner's next /suspend).
Cancel on runDisconnected (crash, exit, or run replaced on the socket).
Both cancels are guarded by a runnerId match (new TimerWheel.peek()): a stale duplicate runner for a reassigned run must not cancel the fresh runner's pending snapshot. A missing runnerId falls through to an unconditional cancel (the pre-existing /continue behavior is unchanged).

Waitpoint suspensions keep the runner socket connected and the attempt incomplete, so neither hook touches a snapshot that is still wanted.

Known limitation (fail-safe direction): socket.data.runnerId is frozen at the websocket handshake, so after a same-supervisor restore the disconnect-path guard refuses the cancel. The attempt.complete path uses the runner's current header id and is unaffected.

changeset-bot · 2026-06-10T16:14:28Z

⚠️ No Changeset found

Latest commit: 3b4c9db

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-06-10T16:14:51Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

This PR adds runner-aware snapshot cancellation to prevent stale runner instances from cancelling snapshots after redeployment or reconnection. A new TimerWheel.peek(key) method inspects pending entries without removing them, enabling ComputeSnapshotService.cancel() to check the stored runnerId and refuse cancellation if it differs. The implementation is covered by comprehensive tests for scheduling, cancellation guards, and snapshot replacement semantics. Two workloadServer handlers—attempt.complete (HTTP) and runDisconnected (WebSocket)—now pass the runner identifier when cancelling snapshots.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: canceling pending delayed snapshots when runs complete or disconnect, which is the core fix addressed throughout the changeset.
Description check	✅ Passed	The description provides a clear problem statement, explains the root cause, documents the specific changes made, and notes known limitations, but does not include testing steps or changelog section as specified in the template.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tri-10293-cancel-stale-delayed-snapshots

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…nnects The compute suspend flow delays snapshots by snapshotDelayMs to avoid wasted work on short-lived waitpoints, with the intent that a run which continues before the delay expires cancels the pending snapshot. But the only cancel() call site is the /continue workload action, which runners only invoke when restoring from an already-taken snapshot - so a pending snapshot is never actually cancelled (zero snapshot.canceled events in prod). When a run resumes and completes within the delay window, the stale snapshot fires anyway and fcrun pauses the VM for ~6-13s while its controller is mid warm-start long-poll. The frozen guest can't fire its abort timer or send a FIN, so firestarter keeps the connection claimable past the client deadline and dispatches runs into it - each one a ~300s stall (TRI-10293). Cancel the pending snapshot when the attempt completes and when the run socket disconnects. Genuine waitpoint suspensions keep the runner socket connected and the attempt incomplete, so neither hook cancels a snapshot that is still wanted. Cancellation is guarded by runnerId so a stale duplicate runner for a reassigned run can't cancel the new runner's pending snapshot.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

A runner calling attempts/complete has finished executing either way - on a transient completion failure it retries the call, so cancelling only on success leaves the stale snapshot armed in the meantime. The runnerId guard already covers the stale-duplicate-runner case whose completion fails server-side validation. Addresses CodeRabbit review feedback on the PR.

The timer wheel can tick during the awaited completeRunAttempt HTTP round trip, so cancelling after it leaves a window where a due snapshot dispatches anyway and pauses a VM that has moved on. Cancel first - the runnerId guard is ordering-independent, and the runner can't schedule a new suspend until it receives this route's reply, so nothing legitimate can be cancelled early. Matches the existing /continue ordering. Addresses Devin review feedback.

This comment was marked as resolved.

Sign in to view

myftija force-pushed the tri-10293-cancel-stale-delayed-snapshots branch from a76626c to 829fec6 Compare June 11, 2026 13:50

myftija force-pushed the tri-10293-cancel-stale-delayed-snapshots branch from 829fec6 to 2f5b8db Compare June 11, 2026 14:08

myftija marked this pull request as ready for review June 11, 2026 14:16

devin-ai-integration Bot reviewed Jun 11, 2026

View reviewed changes

myftija added 2 commits June 11, 2026 16:40

chore: add server-changes entry

3b4c9db

This comment was marked as resolved.

Sign in to view

ericallam approved these changes Jun 11, 2026

View reviewed changes

myftija merged commit d0b2d79 into main Jun 11, 2026
23 checks passed

myftija deleted the tri-10293-cancel-stale-delayed-snapshots branch June 11, 2026 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(supervisor): cancel pending delayed snapshots when the run completes or disconnects#3894

fix(supervisor): cancel pending delayed snapshots when the run completes or disconnects#3894
myftija merged 4 commits into
mainfrom
tri-10293-cancel-stale-delayed-snapshots

myftija commented Jun 10, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

myftija commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change

Uh oh!

changeset-bot Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

myftija commented Jun 10, 2026 •

edited

Loading

changeset-bot Bot commented Jun 10, 2026 •

edited

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading