feat: add automatic database migration recovery to scripts/develop by mafredri · Pull Request #23466 · coder/coder

mafredri · 2026-03-23T22:11:11Z

🤖 Agent on behalf of mafredri

Problem

When switching branches, the database may have migrations from the other branch. coder server calls EnsureClean() which fails because the DB version is ahead of what the current binary knows. The server exits. The developer's test data (workspaces, custom templates, provisioner configs built up over days/weeks) is at risk — the only existing fix is rm -rf .coderv2/postgres.

Design

Two data stores work together:

Cache file (migration_cache.json) — a JSON map of {version: filename} for every migration that has been applied. Lives next to the postgres data directory. Its only job is answering "did any previously-applied migration disappear from disk?" without starting postgres. Read on every startup, written after the server is healthy.

DB tracking table (_develop.applied_migrations) — stores {version, filename, down_sql} for every applied migration. Lives in a _develop schema in the database. Contains the actual rollback SQL. Only accessed when recovery is needed (temp postgres started) or during the post-health-check capture step.

Conflict detection (cache vs disk)

The cache records what was on disk the last time the server ran successfully. On startup, each cache entry is checked against the migration directory:

File still exists on disk with same name — no conflict for this version.
File missing from disk — conflict. This migration was applied but the current branch doesn't have it.
New files on disk not in cache — not a conflict. These are forward migrations the server will apply via Up().
No cache file at all — not a conflict (first run, or after --db-reset). The server starts normally and the cache is written after the health check.

The cache only detects that something changed. It doesn't attempt recovery — it exits with an error telling the developer to choose --db-rollback or --db-reset.

Common ancestor detection (tracking table vs disk)

When --db-rollback is used, the tracking table determines what to undo. The common ancestor is the highest version where the tracking table and disk agree:

Tracking table (from branch A run):
  445: 000445_chat_runtime.down.sql     ← on disk ✓
  446: 000446_alpha_feature.down.sql    ← MISSING from disk
  447: 000447_alpha_indexes.down.sql    ← MISSING from disk
  448: 000448_alpha_cleanup.down.sql    ← MISSING from disk

schema_migrations: version=448

Common ancestor: 445
Rollback: 448 → 447 → 446 (using stored down SQL)
Result: schema_migrations set to 445, server applies branch B's migrations via Up()

Rollback only proceeds if the missing versions form a contiguous block from the top. Gaps (e.g. 448 and 446 missing but 447 present) indicate version number reuse across branches — not safely rollable, points at --db-reset.

Untracked migrations (schema_migrations version > max tracked version) mean someone ran coder server outside develop.sh. No down SQL stored — points at --db-reset.

State machine

START
  │
  ├─ --db-reset ──→ wipe ──→ server starts fresh
  │
  ▼
read cache, compare disk
  │
  ├─ no cache ──────→ server starts ──→ health check ──→ capture + write cache
  ├─ cache clean ───→ server starts ──→ health check ──→ update cache
  └─ cache conflict ─→ EXIT: "use --db-rollback or --db-reset"

--db-rollback (from conflict):
  │
  start temp postgres ──→ rollbackMigrations()
  │
  ├─ dirty DB? ──────────────────→ EXIT: "use --db-reset"
  ├─ untracked migrations? ──────→ EXIT: "use --db-reset"
  ├─ non-contiguous rollbacks? ──→ EXIT: "use --db-reset"
  │
  └─ apply down SQL per version (each in its own txn)
     ──→ capture + write cache ──→ server starts

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

johnstcn · 2026-03-24T10:06:55Z

Deep Review: `feat(scripts/develop): add automatic database migration recovery`

Codepath Diagram

flowchart TD
    develop["develop() main.go:358"] --> buildBinary["buildBinary()
    ⚠️ full build runs BEFORE recovery check"]
    buildBinary --> recoverDB["recoverDB(ctx, logger, cfg)"]

    recoverDB --> pgCheck{"CODER_PG_CONNECTION_URL
    set?"}

    %% ── Built-in PG path ──
    pgCheck -- "empty" --> handleBuiltin["handleBuiltinPG()"]
    handleBuiltin --> dataExists{"dataDir
    exists?"}
    dataExists -- "no" --> freshInstall["return nil
    (fresh install) ✓"]
    dataExists -- "yes" --> builtinReset{"cfg.dbReset?"}
    builtinReset -- "yes" --> removeDir["os.RemoveAll(pgDir)
    return nil ✓"]
    builtinReset -- "no" --> cacheMismatch1["cacheShowsMismatch()"]

    cacheMismatch1 --> mismatch1{"mismatch?"}
    mismatch1 -- "no" --> ok1["return nil ✓"]
    mismatch1 -- "yes" --> rollbackFlag1{"cfg.dbRollback?"}
    rollbackFlag1 -- "no" --> err1["❌ ERROR: use
    --db-rollback or --db-reset
    ⚠️ P0: triggers on first-ever run"]
    rollbackFlag1 -- "yes" --> startPG["startTempPostgres()
    cleanStalePIDFile()
    ⚠️ P4: ephemeral port TOCTOU"]
    startPG --> connectBuiltin["connectDB()"]
    connectBuiltin --> sync1["syncAndRecover()"]

    %% ── External PG path ──
    pgCheck -- "set" --> handleExternal["handleExternalPG()"]
    handleExternal --> connectExt["connectDB()"]
    connectExt --> extReset{"cfg.dbReset?"}
    extReset -- "yes" --> resetSchema["resetSchema()
    ⚠️ P1: doesn't drop _develop schema
    ⚠️ P1: no interactive confirmation"]
    extReset -- "no" --> cacheMismatch2["cacheShowsMismatch()"]

    cacheMismatch2 --> mismatch2{"mismatch?"}
    mismatch2 -- "no" --> ok2["return nil ✓"]
    mismatch2 -- "yes" --> rollbackFlag2{"cfg.dbRollback?"}
    rollbackFlag2 -- "no" --> err2["❌ ERROR: use
    --db-rollback or --db-reset"]
    rollbackFlag2 -- "yes" --> sync2["syncAndRecover()"]

    %% ── syncAndRecover ──
    sync1 & sync2 --> syncEntry["syncAndRecover()"]
    syncEntry --> createDDL["CREATE SCHEMA _develop
    + tracking table"]
    createDDL --> getVersion["currentMigrationVersion()
    ⚠️ P1: ignores dirty flag"]
    getVersion --> freshDB{"version < 0?"}
    freshDB -- "yes" --> freshReturn["return nil
    (fresh DB) ✓"]
    freshDB -- "no" --> verify["verifyDBCacheConsistency()
    ⚠️ P2: one-directional check"]
    verify --> findRB["findRollbacks()
    ⚠️ P2: os.Stat errors = false negative"]
    findRB --> rbCount{"rollbacks
    found?"}
    rbCount -- "none" --> capture1["captureDownSQL()
    + writeCache()
    (first-run bootstrap)
    ⚠️ P2: writeCache swallows errors"]
    rbCount -- "some" --> contig{"contiguousFromTop()?"}
    contig -- "no" --> errContig["❌ ERROR: not contiguous,
    use --db-reset"]
    contig -- "yes" --> loop["for each rollback
    (desc version order)"]
    loop --> applyRB["applyRollback()"]

    applyRB --> tx["BEGIN TX"]
    tx --> execDown["exec rb.downSQL"]
    execDown --> truncate["TRUNCATE schema_migrations"]
    truncate --> insertVer["INSERT version-1"]
    insertVer --> deleteTrack["DELETE FROM
    _develop.applied_migrations"]
    deleteTrack --> commit["COMMIT"]
    commit --> applyFail{"success?"}
    applyFail -- "no" --> errApply["❌ ERROR:
    use --db-reset"]
    applyFail -- "yes" --> moreRB{"more
    rollbacks?"}
    moreRB -- "yes" --> loop
    moreRB -- "no" --> reread["currentMigrationVersion()
    (re-read)"]
    reread --> capture2["captureDownSQL()
    + writeCache() ✓"]

    %% ── cacheShowsMismatch detail ──
    subgraph cacheShowsMismatch
        direction TB
        readC["readCache(cachePath)"] --> cacheErr{"error?"}
        cacheErr -- "yes" --> noCache["return true,
        'no migration cache file'
        ⚠️ P0: first run trap"]
        cacheErr -- "no" --> iterCache["for each cached entry:
        os.Stat(filename)"]
        iterCache --> fileMissing{"file missing
        on disk?"}
        fileMissing -- "yes" --> mismatchTrue["return true, details"]
        fileMissing -- "no" --> allPresent["return false, '' ✓"]
    end

    style err1 fill:#ff6b6b,color:#fff
    style err2 fill:#ff6b6b,color:#fff
    style errContig fill:#ff6b6b,color:#fff
    style errApply fill:#ff6b6b,color:#fff
    style noCache fill:#ff6b6b,color:#fff
    style resetSchema fill:#ffaa44,color:#fff
    style freshInstall fill:#51cf66,color:#fff
    style ok1 fill:#51cf66,color:#fff
    style ok2 fill:#51cf66,color:#fff
    style freshReturn fill:#51cf66,color:#fff
    style capture1 fill:#51cf66,color:#fff
    style capture2 fill:#51cf66,color:#fff
    style removeDir fill:#51cf66,color:#fff

Issues

#	Sev	Category	Issue
1	P0	Bootstrapping	First run with existing DB is a hard error — every developer breaks on upgrade
2	P1	UX	Recovery check runs after `buildBinary()` — two full build cycles on mismatch
3	P1	Safety	`handleExternalPG` + `--db-reset` runs `DROP SCHEMA CASCADE` on arbitrary external DB with no confirmation
4	P1	Correctness	`resetSchema` doesn't drop `_develop` schema — stale tracking survives `--db-reset`
5	P1	Correctness	`currentMigrationVersion` ignores the `dirty` flag — rollback against half-applied schema
6	P1	Detection gap	Cache only written during recovery — migrations applied by server between runs are invisible
7	P1	Missing feature	`--skip-db-recovery` mentioned in PR description is not implemented
8	P2	Reliability	`writeCache` silently swallows all errors → phantom mismatch on next run
9	P2	Correctness	`verifyDBCacheConsistency` is one-directional (DB→cache), misses cache-side corruption
10	P2	Correctness	`os.Stat` errors other than `NotExist` treated as "file exists" — false negative in `findRollbacks` and `cacheShowsMismatch`
11	P2	Correctness	Password not URL-encoded in `startTempPostgres` connection string — breaks on special chars
12	P2	Consistency	Embedded PG config missing `BinaryRepositoryURL` (flaky Maven mirror) and `Encoding("UTF8")` vs `cli/server.go`
13	P2	Testing	Only `cleanStalePIDFile` tested (~3% coverage) — core algorithm has zero tests
14	P2	Safety	No guard against `_develop` schema being created in a production DB via `CODER_PG_CONNECTION_URL`
15	P3	Fragility	6-char prefix parsing (`name[:6]`) — breaks if convention changes, doesn't match golang-migrate parsing
16	P3	Consistency	Tracking table uses `INTEGER` but `schema_migrations` uses `BIGINT`
17	P3	Consistency	Manual password file read vs. existing `config.Root().PostgresPassword().Read()` helper
18	P3	DX	Error message alignment inconsistent; multi-layer wrapping buries actionable text
19	P3	DX	No documentation, no explanation of `migration_cache.json`
20	P3	Code quality	Dead code: `len(name) < 7` can never trigger (`.down.sql` suffix guarantees `≥ 9`)
21	P4	Robustness	Ephemeral port TOCTOU race between `listener.Close()` and `ep.Start()`
22	P4	DX	`Logger(nil)` on embedded PG silences diagnostics vs. production `Logger(writer)` pattern
23	P4	Safety	PID reuse can cause false-positive stale PID detection

Key Details

#1 (P0) — First-run bootstrap breaks every existing dev. When no migration_cache.json exists, cacheShowsMismatch returns true. Both handleBuiltinPG and handleExternalPG then error out demanding --db-rollback. Nothing is actually wrong with the DB. Fix: when no cache exists but the DB is healthy, auto-bootstrap the cache silently.

#2 (P1) — Two-build-cycle penalty. recoverDB() is called after buildBinary() (main.go line 358). A developer hits the mismatch error after waiting for a full make -j build, then must re-run with --db-rollback triggering another full build. Either move recoverDB() before buildBinary(), or prompt interactively.

#3 (P1) — Unguarded DROP SCHEMA CASCADE on external databases. handleExternalPG + --db-reset executes DROP SCHEMA public CASCADE on whatever CODER_PG_CONNECTION_URL points at — no confirmation, no safety check. If that URL points at a shared staging database (or worse, accidentally production), it's silently wiped. The built-in PG path is fine (os.RemoveAll on a local directory), but the external path needs an interactive confirmation prompt (e.g., "You are about to DROP all schemas in $HOST/$DBNAME. Type the database name to confirm:"), or at minimum refuse to reset an external DB without an explicit --force flag.

#4 (P1) — resetSchema doesn't drop _develop. After --db-reset, _develop.applied_migrations survives with stale entries. Next syncAndRecover sees a tracking table that disagrees with the freshly-empty schema_migrations. Add DROP SCHEMA IF EXISTS _develop CASCADE.

#5 (P1) — dirty flag ignored. currentMigrationVersion reads schema_migrations but ignores the dirty column. If the DB is dirty (half-applied migration), executing stored down_sql against an inconsistent schema will fail unpredictably or make things worse. Should SELECT version, dirty and refuse to recover when dirty, advising --db-reset.

#6 (P1) — Cache goes stale between runs. The cache is only written inside syncAndRecover(). Migrations applied by the server during normal operation are never captured. Scenario: developer bootstraps (cache captures v1–448), switches to feature branch (server applies v449), switches back (v449's file gone), runs develop — cache only knows v1–448, reports no mismatch. Server then fails at EnsureClean(). This is exactly the scenario the feature is meant to prevent.

#13 (P2) — ~3% test coverage. 47 lines of tests for 587 lines of schema-modifying code. The only tested function (cleanStalePIDFile) is a minor utility. Pure functions like contiguousFromTop, cacheShowsMismatch, readCache/writeCache are trivially unit-testable. The P0 first-run bug would have been caught by TestCacheShowsMismatch_NoCacheFile.

Questions

Is there a plan to update the cache after normal server startup (not just during recovery)? Without this, finding # 6 means the feature only catches mismatches from the initial snapshot.
The PR description mentions --skip-db-recovery but it's not implemented. Intentional or oversight?
Why cache-based detection over always querying the DB? For external PG, a direct query is cheap. For built-in PG, the cache avoids starting temp PG, but staleness seems like a worse trade-off.

Suggested Validation Plan

# 1. Verify first-run experience (P0 regression)
rm -rf .coderv2/postgres/migration_cache.json
go run ./scripts/develop/  # Should NOT error for existing devs

# 2. Full branch-switch scenario
go run ./scripts/develop/ --db-rollback   # bootstrap
git checkout feature-branch-with-new-migration
go run ./scripts/develop/                 # server applies migration
git checkout main                         # migration file gone
go run ./scripts/develop/                 # MUST detect mismatch

# 3. --db-reset drops _develop schema
CODER_PG_CONNECTION_URL=... go run ./scripts/develop/ --db-reset
psql $CODER_PG_CONNECTION_URL -c "SELECT * FROM _develop.applied_migrations"  # should fail

# 4. Dirty DB handling
psql -c "UPDATE schema_migrations SET dirty = true"
go run ./scripts/develop/ --db-rollback   # should refuse, advise --db-reset

# 5. Mutual exclusion
go run ./scripts/develop/ --db-rollback --db-reset  # should error

🤖 This review was generated by Coder Agents.

mafredri · 2026-03-24T11:11:27Z

🤖 Agent on behalf of mafredri

Thanks for the thorough review. Addressed the findings:

Already fixed before this review (from earlier review rounds):

(1, P0) first-run bootstrap — handled via firstAdoption silent path
(4, P1) resetSchema not dropping _develop — already drops both schemas
(5, P1) dirty flag ignored — reads dirty, refuses if true, points at --db-reset
(8, P2) writeCache error swallowing — returns and propagates errors
(9, P2) one-directional consistency check — checks both directions
(11, P2) password encoding — uses url.QueryEscape

Accepted and fixed in this push:

(2, P1) moved recoverDB() before buildBinary() — no wasted build cycle
(3, P1) external PG --db-reset now requires CODER_DEV_DB_RESET=1 env confirmation
(6, P1) cache staleness — added updateMigrationTracking() after server health check, captures new migrations and updates cache on every run
(10, P2) os.Stat error handling — any error (not just NotExist) now triggers mismatch/rollback
(16, P3) INTEGER → BIGINT to match schema_migrations

Accepted, not fixing now:

(7) --skip-db-recovery removed by design (exit-on-mismatch replaces it) — updated PR description
(12) missing BinaryRepositoryURL — only matters if binaries not cached (rare), tracking as follow-up
(13) test coverage — fair point, will add in follow-up
(17) manual password read vs config helper — config helper is in cli/config which pulls in more deps; manual read is intentional

Pushed back on:

(14) _develop on production — this is a dev tool run via develop.sh, not coder server. Developer explicitly set the PG URL. Adding a guard would be checking whether the developer meant to do what they explicitly told the tool to do
(15) 6-char prefix — 448 migrations strong, enforced by the project's own naming convention. A regex adds complexity for a hypothetical future change
(20) dead code len(name) < 7 — it guards name[:6] from panicking on short filenames that happen to end in .down.sql. Not dead, defensive

Watching the CI go green from the Finnish archipelago. Hyvää yötä! 🌊

coder-tasks · 2026-03-24T13:05:33Z

Documentation Check

Updates Needed

docs/about/contributing/CONTRIBUTING.md - The Troubleshooting section (and/or the "Running Coder on development mode" section) should document the new --db-rollback / --db-reset flags and the migration-mismatch scenario (e.g., switching git branches leaves the DB at a version the current branch doesn't know about). Developers who hit this error will look here first. A short note pointing to these flags is enough — the error message itself explains the choice.

Automated review via Coder Tasks

When developers switch branches, the database may have migrations from the other branch that don't exist in the current binary. This causes coder server to fail at startup, leaving developers stuck. The develop script now detects this before starting the server: 1. Connects to postgres (starts temp embedded instance for built-in postgres, or uses CODER_PG_CONNECTION_URL). 2. Compares DB version against the source's latest migration. 3. If DB is ahead, searches git history for the missing down SQL files and applies them in a transaction. 4. If git recovery fails (ambiguous versions across branches, missing files), falls back to resetting the public schema. Also adds --reset-db and --skip-db-recovery flags.

github-actions Bot assigned mafredri Mar 23, 2026

mafredri force-pushed the develop-db-recovery branch 9 times, most recently from 8c19c94 to 145ead6 Compare March 24, 2026 09:28

mafredri force-pushed the develop-db-recovery branch 3 times, most recently from febf4ce to 196e2b0 Compare March 24, 2026 11:11

mafredri force-pushed the develop-db-recovery branch from 196e2b0 to 664ef62 Compare March 24, 2026 11:35

johnstcn approved these changes Mar 24, 2026

View reviewed changes

mafredri marked this pull request as ready for review March 24, 2026 13:02

mafredri force-pushed the develop-db-recovery branch 3 times, most recently from 1009162 to 540243f Compare March 24, 2026 17:28

mafredri changed the title ~~feat(scripts/develop): add automatic database migration recovery~~ feat: add automatic database migration recovery to scripts/develop Mar 24, 2026

mafredri force-pushed the develop-db-recovery branch 2 times, most recently from 2dd6a24 to 105a427 Compare March 24, 2026 18:55

mafredri force-pushed the develop-db-recovery branch from 105a427 to 4d96462 Compare March 24, 2026 19:31

mafredri merged commit 78b18e7 into main Mar 24, 2026
28 checks passed

mafredri deleted the develop-db-recovery branch March 24, 2026 20:05

github-actions Bot locked and limited conversation to collaborators Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add automatic database migration recovery to scripts/develop#23466

feat: add automatic database migration recovery to scripts/develop#23466
mafredri merged 1 commit into
mainfrom
develop-db-recovery

mafredri commented Mar 23, 2026 •

edited

Loading

Uh oh!

johnstcn commented Mar 24, 2026 •

edited

Loading

Uh oh!

mafredri commented Mar 24, 2026 •

edited

Loading

Uh oh!

coder-tasks Bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mafredri commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Design

Conflict detection (cache vs disk)

Common ancestor detection (tracking table vs disk)

State machine

Uh oh!

johnstcn commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deep Review: feat(scripts/develop): add automatic database migration recovery

Codepath Diagram

Issues

Key Details

Questions

Suggested Validation Plan

Uh oh!

mafredri commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coder-tasks Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation Check

Updates Needed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mafredri commented Mar 23, 2026 •

edited

Loading

johnstcn commented Mar 24, 2026 •

edited

Loading

Deep Review: `feat(scripts/develop): add automatic database migration recovery`

mafredri commented Mar 24, 2026 •

edited

Loading

coder-tasks Bot commented Mar 24, 2026 •

edited

Loading