fix(run-engine): retry getSnapshotsSince on the replica then primary when the read replica lags#3889
Conversation
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR implements replica-aware snapshot polling for the Run Engine. When read-replica snapshots are enabled, 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
… for replica-lag tests Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…edundant cast Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ionSnapshotsSince Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…lica misses the since snapshot Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…p fallback without a replica Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… and counter semantics for getSnapshotsSince Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…note, trim dead test setup Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… a different run (red) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ment Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…etSnapshotsSince Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…stinguish fallback failure logs Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ds, tie-stable test assertions, shared clone/drop helpers Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
a26a5f0 to
f4dac20
Compare
@trigger.dev/build
trigger.dev
@trigger.dev/core
@trigger.dev/plugins
@trigger.dev/python
@trigger.dev/react-hooks
@trigger.dev/redis-worker
@trigger.dev/rsc
@trigger.dev/schema-to-json
@trigger.dev/sdk
commit: |
Critical path: before / afterHow a runner's Before (why enabling the flag was previously unsafe)sequenceDiagram
participant R as Runner
participant E as Webapp / RunEngine
participant RO as Read replica
participant P as Primary (writer)
R->>E: poll since snap_X
E->>RO: findFirst(snap_X)
RO-->>E: ✗ not replicated yet
Note over E: ERROR log → alert noise
E-->>R: 500
Note over R: wait for next poll interval (~5s)<br/>notification path degrades to polling
R->>E: poll again
E->>RO: findFirst(snap_X)
RO-->>E: ✓ replicated by now
E-->>R: 200 + snapshots
Cost per lag hit: error spam + alert noise + up to a full poll interval of added latency on the snapshot transition. After (this PR)sequenceDiagram
participant R as Runner
participant E as Webapp / RunEngine
participant RO as Read replica
participant P as Primary (writer)
R->>E: poll since snap_X
E->>RO: findFirst(snap_X)
RO-->>E: ✗ not replicated yet
Note over E: sleep 50–200ms (jittered, configurable.<br/>MAX_MS=0 skips straight to primary)
E->>RO: retry findFirst(snap_X)
alt replica caught up (common case)
RO-->>E: ✓ snapshots
Note over E: counter++ outcome=replica_retry (quiet)
E-->>R: 200 + snapshots
else lag outlasts the retry window
RO-->>E: ✗ still missing
E->>P: full query on the writer
P-->>E: ✓ snapshots
Note over E: counter++ outcome=primary + WARN (writer spill)
E-->>R: 200 + snapshots
else permanent miss (bogus / pruned id, or wrong run)
RO-->>E: ✗ still missing
E->>P: full query on the writer
P-->>E: ✗ not found
Note over E: ERROR log — unchanged from today
E-->>R: 500
end
Cost per lag hit: ≤ ~200ms added latency; the writer is only touched when lag outlasts the retry window. Invariants
|
…eplica retry Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
@d-cs probably doesn't need 2 server changes files 👍 |
…entry Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…the writer fallback Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Combined into a single server-changes entry in 51924bb. |
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Summary
When
RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLEDis on,RunEngine.getSnapshotsSincereads from the read replica. During write spikes the replica can briefly lag, so the snapshot id a runner just learned from the writer isn't visible there yet: the lookup threw, the worker route returned a 500, and the runner waited for its next poll — turning sub-second snapshot notifications into poll-interval latency exactly when things are busiest. This PR makes the flag safe to enable: a replica miss of the since snapshot gets one jittered retry on the replica (most lag windows are shorter than the ~50–200ms wait, so the writer is never touched), then falls back to the primary, observed via a newrun_engine.snapshots_since.replica_misscounter with anoutcomeattribute (replica_retryvsprimary). Only genuine misses — absent on the primary too — remain errors.Design
getExecutionSnapshotsSincenow throws a typedExecutionSnapshotNotFoundErrorso the engine can distinguish the expected lag miss from real failures. The message string is unchanged and the error never leaves the engine.RUN_ENGINE_SNAPSHOTS_SINCE_REPLICA_RETRY_MIN_MS/MAX_MS, default 50/200;MAX_MS=0skips the replica retry and goes straight to the primary).failedDuringfield, so lag metrics aren't polluted by bogus ids.where: { id, runId }), so a snapshot id from a different run raises not-found instead of silently anchoring a too-wide window of the run's snapshots.Test plan
All vitest + testcontainers, no mocks. A new
schemaOnlyPrismafixture (migrated-but-empty clone database) simulates a replica that hasn't caught up, and a real in-memory OTel meter pins the counter semantics per outcome.outcome=replica_retry= 1, primary never consultedoutcome=primary= 1🤖 Generated with Claude Code