fix(run-engine): retry getSnapshotsSince on the replica then primary when the read replica lags by d-cs · Pull Request #3889 · triggerdotdev/trigger.dev

d-cs · 2026-06-10T14:17:51Z

Summary

When RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED is on, RunEngine.getSnapshotsSince reads from the read replica. During write spikes the replica can briefly lag, so the snapshot id a runner just learned from the writer isn't visible there yet: the lookup threw, the worker route returned a 500, and the runner waited for its next poll — turning sub-second snapshot notifications into poll-interval latency exactly when things are busiest. This PR makes the flag safe to enable: a replica miss of the since snapshot gets one jittered retry on the replica (most lag windows are shorter than the ~50–200ms wait, so the writer is never touched), then falls back to the primary, observed via a new run_engine.snapshots_since.replica_miss counter with an outcome attribute (replica_retry vs primary). Only genuine misses — absent on the primary too — remain errors.

Design

getExecutionSnapshotsSince now throws a typed ExecutionSnapshotNotFoundError so the engine can distinguish the expected lag miss from real failures. The message string is unchanged and the error never leaves the engine.
The recovery path only engages when the flag is on, a distinct replica client is configured, and no transaction client was passed. With the flag off, the path is behaviorally identical to before.
Retry delay bounds are configurable (RUN_ENGINE_SNAPSHOTS_SINCE_REPLICA_RETRY_MIN_MS/MAX_MS, default 50/200; MAX_MS=0 skips the replica retry and goes straight to the primary).
The warn log fires only when the primary serves the read (the writer spill is the operationally interesting event); replica-retry recoveries are counted but quiet. A permanently-missing snapshot id stays an error-level failure with a failedDuring field, so lag metrics aren't polluted by bogus ids.
Stale-tail lag (replica has the since snapshot but not newer rows) deliberately still returns the replica's view; the next poll catches up.
The since-snapshot anchor lookup is now scoped to the polled run (where: { id, runId }), so a snapshot id from a different run raises not-found instead of silently anchoring a too-wide window of the run's snapshots.

Test plan

All vitest + testcontainers, no mocks. A new schemaOnlyPrisma fixture (migrated-but-empty clone database) simulates a replica that hasn't caught up, and a real in-memory OTel meter pins the counter semantics per outcome.

Replica catches up during the jittered retry window → served by the replica, outcome=replica_retry = 1, primary never consulted
Replica permanently missing the since snapshot → served by the primary, outcome=primary = 1
Snapshot missing on both replica and primary → null, counter = 0
Replica has the since snapshot but lags by one → the replica's view is served, no fallback (verified discriminating power: the test fails if reads secretly hit the primary)
Flag off with a replica configured → primary serves the read
Transaction client provided → bypasses the replica entirely
Since snapshot belonging to a different run → null
Existing getSnapshotsSince + waitpoints suites green; run-engine, testcontainers, and webapp typechecks pass

🤖 Generated with Claude Code

changeset-bot · 2026-06-10T14:17:56Z

⚠️ No Changeset found

Latest commit: 8418c7b

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-06-10T14:18:08Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

This PR implements replica-aware snapshot polling for the Run Engine. When read-replica snapshots are enabled, getSnapshotsSince now validates that the requested snapshot ID belongs to the specified run, retries once on the replica with configurable jittered delay, and falls back to the primary database if the replica hasn't caught up. New test helpers enable copying snapshots to replicas and collecting OpenTelemetry metrics. A new schema-only Prisma fixture supports replica testing. Integration tests validate retry logic, metrics recording, primary fallback, and feature flag control.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding retry logic on the read replica followed by primary fallback for getSnapshotsSince when replica lags.
Description check	✅ Passed	The description includes comprehensive sections covering summary, design decisions, test plan with verification checkmarks, and follows the required template structure with most key elements addressed.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch snapshots-since-replica-primary-fallback

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

… for replica-lag tests Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…edundant cast Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ionSnapshotsSince Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…lica misses the since snapshot Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…p fallback without a replica Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… and counter semantics for getSnapshotsSince Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…note, trim dead test setup Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… a different run (red) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ment Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…etSnapshotsSince Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…stinguish fallback failure logs Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ds, tie-stable test assertions, shared clone/drop helpers Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

pkg-pr-new · 2026-06-10T15:27:47Z

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@f4dac20

trigger.dev

npm i https://pkg.pr.new/trigger.dev@f4dac20

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@f4dac20

@trigger.dev/plugins

npm i https://pkg.pr.new/@trigger.dev/plugins@f4dac20

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@f4dac20

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@f4dac20

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@f4dac20

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@f4dac20

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@f4dac20

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@f4dac20

commit: f4dac20

d-cs · 2026-06-10T15:29:55Z

Critical path: before / after

How a runner's GET /snapshots/since/:id poll behaves during a replica-lag window with RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED on.

Before (why enabling the flag was previously unsafe)

sequenceDiagram
    participant R as Runner
    participant E as Webapp / RunEngine
    participant RO as Read replica
    participant P as Primary (writer)

    R->>E: poll since snap_X
    E->>RO: findFirst(snap_X)
    RO-->>E: ✗ not replicated yet
    Note over E: ERROR log → alert noise
    E-->>R: 500
    Note over R: wait for next poll interval (~5s)<br/>notification path degrades to polling
    R->>E: poll again
    E->>RO: findFirst(snap_X)
    RO-->>E: ✓ replicated by now
    E-->>R: 200 + snapshots

Cost per lag hit: error spam + alert noise + up to a full poll interval of added latency on the snapshot transition.

After (this PR)

sequenceDiagram
    participant R as Runner
    participant E as Webapp / RunEngine
    participant RO as Read replica
    participant P as Primary (writer)

    R->>E: poll since snap_X
    E->>RO: findFirst(snap_X)
    RO-->>E: ✗ not replicated yet
    Note over E: sleep 50–200ms (jittered, configurable.<br/>MAX_MS=0 skips straight to primary)
    E->>RO: retry findFirst(snap_X)
    alt replica caught up (common case)
        RO-->>E: ✓ snapshots
        Note over E: counter++ outcome=replica_retry (quiet)
        E-->>R: 200 + snapshots
    else lag outlasts the retry window
        RO-->>E: ✗ still missing
        E->>P: full query on the writer
        P-->>E: ✓ snapshots
        Note over E: counter++ outcome=primary + WARN (writer spill)
        E-->>R: 200 + snapshots
    else permanent miss (bogus / pruned id, or wrong run)
        RO-->>E: ✗ still missing
        E->>P: full query on the writer
        P-->>E: ✗ not found
        Note over E: ERROR log — unchanged from today
        E-->>R: 500
    end

Cost per lag hit: ≤ ~200ms added latency; the writer is only touched when lag outlasts the retry window.

Invariants

Flag off (and/or no distinct replica client configured, and/or a tx passed): none of this exists — the path is the old single primary read, unchanged.
Error path preserved: a snapshot id that is genuinely gone — or belongs to a different run, after the scoping fix in this PR — walks all three attempts and still ends in the same error-log-plus-500 as today, so stuck clients stay visible.
The run_engine.snapshots_since.replica_miss counter's two outcomes give the rollout signal: replica_retry ≈ lag frequency, primary ≈ writer-spill rate.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 6 additional findings.

…eplica retry Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ericallam · 2026-06-10T15:52:51Z

@d-cs probably doesn't need 2 server changes files 👍

…entry Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

devin-ai-integration

Devin Review found 1 new potential issue.

View 6 additional findings in Devin Review.

…the writer fallback Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

d-cs · 2026-06-10T16:22:42Z

Combined into a single server-changes entry in 51924bb.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

d-cs self-assigned this Jun 10, 2026

This comment was marked as resolved.

Sign in to view

d-cs and others added 17 commits June 10, 2026 16:22

feat(testcontainers): schemaOnlyPrisma fixture - empty template clone…

79dd2da

… for replica-lag tests Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test(run-engine): cover getSnapshotsSince replica-miss fallback (red)

d44f4f3

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test(run-engine): tie-proof replica fallback window assertion, drop r…

516533c

…edundant cast Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat(run-engine): typed ExecutionSnapshotNotFoundError from getExecut…

ade8c55

…ionSnapshotsSince Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat(run-engine): retry getSnapshotsSince on the primary when the rep…

73d5c55

…lica misses the since snapshot Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

fix(run-engine): only count replica misses the primary can serve; ski…

d680da8

…p fallback without a replica Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

chore: server-changes note for snapshots-since replica fallback

6651568

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test(run-engine): cover replica-read, stale-tail, flag-off, tx bypass…

a868b3a

… and counter semantics for getSnapshotsSince Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

fix(run-engine): mark replica-retry failures in logs, narrow release …

9737060

…note, trim dead test setup Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test(run-engine): getSnapshotsSince must reject a since snapshot from…

a5d59a1

… a different run (red) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

fix(run-engine): scope getSnapshotsSince anchor lookup to the polled run

1c128bd

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

chore(run-engine): trim assertion-echo comments, tighten fallback com…

4f98828

…ment Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test(run-engine): replica catches up during jittered retry window (red)

01675a8

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat(run-engine): jittered replica retry before primary fallback in g…

b4625be

…etSnapshotsSince Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

chore(run-engine): harden retry-window test against cold connects, di…

6673a94

…stinguish fallback failure logs Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

chore: unwrap server-changes notes to one line per paragraph

1157227

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

fix(run-engine,testcontainers): address review - normalize retry boun…

f4dac20

…ds, tie-stable test assertions, shared clone/drop helpers Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

d-cs force-pushed the snapshots-since-replica-primary-fallback branch from a26a5f0 to f4dac20 Compare June 10, 2026 15:24

devin-ai-integration Bot reviewed Jun 10, 2026

View reviewed changes

docs(run-engine): document maxMs <= 0 (not just 0) as disabling the r…

f26b646

…eplica retry Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

d-cs marked this pull request as ready for review June 10, 2026 15:49

ericallam approved these changes Jun 10, 2026

View reviewed changes

chore: combine the two snapshots-since server-changes notes into one …

51924bb

…entry Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

devin-ai-integration Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread internal-packages/run-engine/src/engine/index.ts

docs(run-engine): clarify that only not-found replica errors trigger …

3a5e88b

…the writer fallback Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

chore: retrigger CI

8418c7b

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

d-cs enabled auto-merge (squash) June 10, 2026 16:53

d-cs merged commit 6afc9bf into main Jun 10, 2026
88 of 91 checks passed

d-cs deleted the snapshots-since-replica-primary-fallback branch June 10, 2026 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(run-engine): retry getSnapshotsSince on the replica then primary when the read replica lags#3889

fix(run-engine): retry getSnapshotsSince on the replica then primary when the read replica lags#3889
d-cs merged 21 commits into
mainfrom
snapshots-since-replica-primary-fallback

d-cs commented Jun 10, 2026

Uh oh!

changeset-bot Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

pkg-pr-new Bot commented Jun 10, 2026

Uh oh!

d-cs commented Jun 10, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

ericallam commented Jun 10, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

d-cs commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

d-cs commented Jun 10, 2026

Summary

Design

Test plan

Uh oh!

changeset-bot Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

pkg-pr-new Bot commented Jun 10, 2026

Uh oh!

d-cs commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Critical path: before / after

Before (why enabling the flag was previously unsafe)

After (this PR)

Invariants

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

ericallam commented Jun 10, 2026

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

d-cs commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

changeset-bot Bot commented Jun 10, 2026 •

edited

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

d-cs commented Jun 10, 2026 •

edited

Loading