Skip to content

feat: add automatic database migration recovery to scripts/develop#23466

Merged
mafredri merged 1 commit into
mainfrom
develop-db-recovery
Mar 24, 2026
Merged

feat: add automatic database migration recovery to scripts/develop#23466
mafredri merged 1 commit into
mainfrom
develop-db-recovery

Conversation

@mafredri

@mafredri mafredri commented Mar 23, 2026

Copy link
Copy Markdown
Member

🤖 Agent on behalf of mafredri

Problem

When switching branches, the database may have migrations from the other branch. coder server calls EnsureClean() which fails because the DB version is ahead of what the current binary knows. The server exits. The developer's test data (workspaces, custom templates, provisioner configs built up over days/weeks) is at risk — the only existing fix is rm -rf .coderv2/postgres.

Design

Two data stores work together:

Cache file (migration_cache.json) — a JSON map of {version: filename} for every migration that has been applied. Lives next to the postgres data directory. Its only job is answering "did any previously-applied migration disappear from disk?" without starting postgres. Read on every startup, written after the server is healthy.

DB tracking table (_develop.applied_migrations) — stores {version, filename, down_sql} for every applied migration. Lives in a _develop schema in the database. Contains the actual rollback SQL. Only accessed when recovery is needed (temp postgres started) or during the post-health-check capture step.

Conflict detection (cache vs disk)

The cache records what was on disk the last time the server ran successfully. On startup, each cache entry is checked against the migration directory:

  • File still exists on disk with same name — no conflict for this version.
  • File missing from disk — conflict. This migration was applied but the current branch doesn't have it.
  • New files on disk not in cache — not a conflict. These are forward migrations the server will apply via Up().
  • No cache file at all — not a conflict (first run, or after --db-reset). The server starts normally and the cache is written after the health check.

The cache only detects that something changed. It doesn't attempt recovery — it exits with an error telling the developer to choose --db-rollback or --db-reset.

Common ancestor detection (tracking table vs disk)

When --db-rollback is used, the tracking table determines what to undo. The common ancestor is the highest version where the tracking table and disk agree:

Tracking table (from branch A run):
  445: 000445_chat_runtime.down.sql     ← on disk ✓
  446: 000446_alpha_feature.down.sql    ← MISSING from disk
  447: 000447_alpha_indexes.down.sql    ← MISSING from disk
  448: 000448_alpha_cleanup.down.sql    ← MISSING from disk

schema_migrations: version=448

Common ancestor: 445
Rollback: 448 → 447 → 446 (using stored down SQL)
Result: schema_migrations set to 445, server applies branch B's migrations via Up()

Rollback only proceeds if the missing versions form a contiguous block from the top. Gaps (e.g. 448 and 446 missing but 447 present) indicate version number reuse across branches — not safely rollable, points at --db-reset.

Untracked migrations (schema_migrations version > max tracked version) mean someone ran coder server outside develop.sh. No down SQL stored — points at --db-reset.

State machine

START
  │
  ├─ --db-reset ──→ wipe ──→ server starts fresh
  │
  ▼
read cache, compare disk
  │
  ├─ no cache ──────→ server starts ──→ health check ──→ capture + write cache
  ├─ cache clean ───→ server starts ──→ health check ──→ update cache
  └─ cache conflict ─→ EXIT: "use --db-rollback or --db-reset"

--db-rollback (from conflict):
  │
  start temp postgres ──→ rollbackMigrations()
  │
  ├─ dirty DB? ──────────────────→ EXIT: "use --db-reset"
  ├─ untracked migrations? ──────→ EXIT: "use --db-reset"
  ├─ non-contiguous rollbacks? ──→ EXIT: "use --db-reset"
  │
  └─ apply down SQL per version (each in its own txn)
     ──→ capture + write cache ──→ server starts

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

@mafredri mafredri force-pushed the develop-db-recovery branch 9 times, most recently from 8c19c94 to 145ead6 Compare March 24, 2026 09:28

johnstcn commented Mar 24, 2026

Copy link
Copy Markdown
Member

Deep Review: feat(scripts/develop): add automatic database migration recovery

Codepath Diagram

flowchart TD
    develop["develop() main.go:358"] --> buildBinary["buildBinary()
    ⚠️ full build runs BEFORE recovery check"]
    buildBinary --> recoverDB["recoverDB(ctx, logger, cfg)"]

    recoverDB --> pgCheck{"CODER_PG_CONNECTION_URL
    set?"}

    %% ── Built-in PG path ──
    pgCheck -- "empty" --> handleBuiltin["handleBuiltinPG()"]
    handleBuiltin --> dataExists{"dataDir
    exists?"}
    dataExists -- "no" --> freshInstall["return nil
    (fresh install) ✓"]
    dataExists -- "yes" --> builtinReset{"cfg.dbReset?"}
    builtinReset -- "yes" --> removeDir["os.RemoveAll(pgDir)
    return nil ✓"]
    builtinReset -- "no" --> cacheMismatch1["cacheShowsMismatch()"]

    cacheMismatch1 --> mismatch1{"mismatch?"}
    mismatch1 -- "no" --> ok1["return nil ✓"]
    mismatch1 -- "yes" --> rollbackFlag1{"cfg.dbRollback?"}
    rollbackFlag1 -- "no" --> err1["❌ ERROR: use
    --db-rollback or --db-reset
    ⚠️ P0: triggers on first-ever run"]
    rollbackFlag1 -- "yes" --> startPG["startTempPostgres()
    cleanStalePIDFile()
    ⚠️ P4: ephemeral port TOCTOU"]
    startPG --> connectBuiltin["connectDB()"]
    connectBuiltin --> sync1["syncAndRecover()"]

    %% ── External PG path ──
    pgCheck -- "set" --> handleExternal["handleExternalPG()"]
    handleExternal --> connectExt["connectDB()"]
    connectExt --> extReset{"cfg.dbReset?"}
    extReset -- "yes" --> resetSchema["resetSchema()
    ⚠️ P1: doesn't drop _develop schema
    ⚠️ P1: no interactive confirmation"]
    extReset -- "no" --> cacheMismatch2["cacheShowsMismatch()"]

    cacheMismatch2 --> mismatch2{"mismatch?"}
    mismatch2 -- "no" --> ok2["return nil ✓"]
    mismatch2 -- "yes" --> rollbackFlag2{"cfg.dbRollback?"}
    rollbackFlag2 -- "no" --> err2["❌ ERROR: use
    --db-rollback or --db-reset"]
    rollbackFlag2 -- "yes" --> sync2["syncAndRecover()"]

    %% ── syncAndRecover ──
    sync1 & sync2 --> syncEntry["syncAndRecover()"]
    syncEntry --> createDDL["CREATE SCHEMA _develop
    + tracking table"]
    createDDL --> getVersion["currentMigrationVersion()
    ⚠️ P1: ignores dirty flag"]
    getVersion --> freshDB{"version < 0?"}
    freshDB -- "yes" --> freshReturn["return nil
    (fresh DB) ✓"]
    freshDB -- "no" --> verify["verifyDBCacheConsistency()
    ⚠️ P2: one-directional check"]
    verify --> findRB["findRollbacks()
    ⚠️ P2: os.Stat errors = false negative"]
    findRB --> rbCount{"rollbacks
    found?"}
    rbCount -- "none" --> capture1["captureDownSQL()
    + writeCache()
    (first-run bootstrap)
    ⚠️ P2: writeCache swallows errors"]
    rbCount -- "some" --> contig{"contiguousFromTop()?"}
    contig -- "no" --> errContig["❌ ERROR: not contiguous,
    use --db-reset"]
    contig -- "yes" --> loop["for each rollback
    (desc version order)"]
    loop --> applyRB["applyRollback()"]

    applyRB --> tx["BEGIN TX"]
    tx --> execDown["exec rb.downSQL"]
    execDown --> truncate["TRUNCATE schema_migrations"]
    truncate --> insertVer["INSERT version-1"]
    insertVer --> deleteTrack["DELETE FROM
    _develop.applied_migrations"]
    deleteTrack --> commit["COMMIT"]
    commit --> applyFail{"success?"}
    applyFail -- "no" --> errApply["❌ ERROR:
    use --db-reset"]
    applyFail -- "yes" --> moreRB{"more
    rollbacks?"}
    moreRB -- "yes" --> loop
    moreRB -- "no" --> reread["currentMigrationVersion()
    (re-read)"]
    reread --> capture2["captureDownSQL()
    + writeCache() ✓"]

    %% ── cacheShowsMismatch detail ──
    subgraph cacheShowsMismatch
        direction TB
        readC["readCache(cachePath)"] --> cacheErr{"error?"}
        cacheErr -- "yes" --> noCache["return true,
        'no migration cache file'
        ⚠️ P0: first run trap"]
        cacheErr -- "no" --> iterCache["for each cached entry:
        os.Stat(filename)"]
        iterCache --> fileMissing{"file missing
        on disk?"}
        fileMissing -- "yes" --> mismatchTrue["return true, details"]
        fileMissing -- "no" --> allPresent["return false, '' ✓"]
    end

    style err1 fill:#ff6b6b,color:#fff
    style err2 fill:#ff6b6b,color:#fff
    style errContig fill:#ff6b6b,color:#fff
    style errApply fill:#ff6b6b,color:#fff
    style noCache fill:#ff6b6b,color:#fff
    style resetSchema fill:#ffaa44,color:#fff
    style freshInstall fill:#51cf66,color:#fff
    style ok1 fill:#51cf66,color:#fff
    style ok2 fill:#51cf66,color:#fff
    style freshReturn fill:#51cf66,color:#fff
    style capture1 fill:#51cf66,color:#fff
    style capture2 fill:#51cf66,color:#fff
    style removeDir fill:#51cf66,color:#fff
Loading

Issues

# Sev Category Issue
1 P0 Bootstrapping First run with existing DB is a hard error — every developer breaks on upgrade
2 P1 UX Recovery check runs after buildBinary() — two full build cycles on mismatch
3 P1 Safety handleExternalPG + --db-reset runs DROP SCHEMA CASCADE on arbitrary external DB with no confirmation
4 P1 Correctness resetSchema doesn't drop _develop schema — stale tracking survives --db-reset
5 P1 Correctness currentMigrationVersion ignores the dirty flag — rollback against half-applied schema
6 P1 Detection gap Cache only written during recovery — migrations applied by server between runs are invisible
7 P1 Missing feature --skip-db-recovery mentioned in PR description is not implemented
8 P2 Reliability writeCache silently swallows all errors → phantom mismatch on next run
9 P2 Correctness verifyDBCacheConsistency is one-directional (DB→cache), misses cache-side corruption
10 P2 Correctness os.Stat errors other than NotExist treated as "file exists" — false negative in findRollbacks and cacheShowsMismatch
11 P2 Correctness Password not URL-encoded in startTempPostgres connection string — breaks on special chars
12 P2 Consistency Embedded PG config missing BinaryRepositoryURL (flaky Maven mirror) and Encoding("UTF8") vs cli/server.go
13 P2 Testing Only cleanStalePIDFile tested (~3% coverage) — core algorithm has zero tests
14 P2 Safety No guard against _develop schema being created in a production DB via CODER_PG_CONNECTION_URL
15 P3 Fragility 6-char prefix parsing (name[:6]) — breaks if convention changes, doesn't match golang-migrate parsing
16 P3 Consistency Tracking table uses INTEGER but schema_migrations uses BIGINT
17 P3 Consistency Manual password file read vs. existing config.Root().PostgresPassword().Read() helper
18 P3 DX Error message alignment inconsistent; multi-layer wrapping buries actionable text
19 P3 DX No documentation, no explanation of migration_cache.json
20 P3 Code quality Dead code: len(name) < 7 can never trigger (.down.sql suffix guarantees ≥ 9)
21 P4 Robustness Ephemeral port TOCTOU race between listener.Close() and ep.Start()
22 P4 DX Logger(nil) on embedded PG silences diagnostics vs. production Logger(writer) pattern
23 P4 Safety PID reuse can cause false-positive stale PID detection

Key Details

#1 (P0) — First-run bootstrap breaks every existing dev. When no migration_cache.json exists, cacheShowsMismatch returns true. Both handleBuiltinPG and handleExternalPG then error out demanding --db-rollback. Nothing is actually wrong with the DB. Fix: when no cache exists but the DB is healthy, auto-bootstrap the cache silently.

#2 (P1) — Two-build-cycle penalty. recoverDB() is called after buildBinary() (main.go line 358). A developer hits the mismatch error after waiting for a full make -j build, then must re-run with --db-rollback triggering another full build. Either move recoverDB() before buildBinary(), or prompt interactively.

#3 (P1) — Unguarded DROP SCHEMA CASCADE on external databases. handleExternalPG + --db-reset executes DROP SCHEMA public CASCADE on whatever CODER_PG_CONNECTION_URL points at — no confirmation, no safety check. If that URL points at a shared staging database (or worse, accidentally production), it's silently wiped. The built-in PG path is fine (os.RemoveAll on a local directory), but the external path needs an interactive confirmation prompt (e.g., "You are about to DROP all schemas in $HOST/$DBNAME. Type the database name to confirm:"), or at minimum refuse to reset an external DB without an explicit --force flag.

#4 (P1) — resetSchema doesn't drop _develop. After --db-reset, _develop.applied_migrations survives with stale entries. Next syncAndRecover sees a tracking table that disagrees with the freshly-empty schema_migrations. Add DROP SCHEMA IF EXISTS _develop CASCADE.

#5 (P1) — dirty flag ignored. currentMigrationVersion reads schema_migrations but ignores the dirty column. If the DB is dirty (half-applied migration), executing stored down_sql against an inconsistent schema will fail unpredictably or make things worse. Should SELECT version, dirty and refuse to recover when dirty, advising --db-reset.

#6 (P1) — Cache goes stale between runs. The cache is only written inside syncAndRecover(). Migrations applied by the server during normal operation are never captured. Scenario: developer bootstraps (cache captures v1–448), switches to feature branch (server applies v449), switches back (v449's file gone), runs develop — cache only knows v1–448, reports no mismatch. Server then fails at EnsureClean(). This is exactly the scenario the feature is meant to prevent.

#13 (P2) — ~3% test coverage. 47 lines of tests for 587 lines of schema-modifying code. The only tested function (cleanStalePIDFile) is a minor utility. Pure functions like contiguousFromTop, cacheShowsMismatch, readCache/writeCache are trivially unit-testable. The P0 first-run bug would have been caught by TestCacheShowsMismatch_NoCacheFile.


Questions

  1. Is there a plan to update the cache after normal server startup (not just during recovery)? Without this, finding # 6 means the feature only catches mismatches from the initial snapshot.
  2. The PR description mentions --skip-db-recovery but it's not implemented. Intentional or oversight?
  3. Why cache-based detection over always querying the DB? For external PG, a direct query is cheap. For built-in PG, the cache avoids starting temp PG, but staleness seems like a worse trade-off.

Suggested Validation Plan

# 1. Verify first-run experience (P0 regression)
rm -rf .coderv2/postgres/migration_cache.json
go run ./scripts/develop/  # Should NOT error for existing devs

# 2. Full branch-switch scenario
go run ./scripts/develop/ --db-rollback   # bootstrap
git checkout feature-branch-with-new-migration
go run ./scripts/develop/                 # server applies migration
git checkout main                         # migration file gone
go run ./scripts/develop/                 # MUST detect mismatch

# 3. --db-reset drops _develop schema
CODER_PG_CONNECTION_URL=... go run ./scripts/develop/ --db-reset
psql $CODER_PG_CONNECTION_URL -c "SELECT * FROM _develop.applied_migrations"  # should fail

# 4. Dirty DB handling
psql -c "UPDATE schema_migrations SET dirty = true"
go run ./scripts/develop/ --db-rollback   # should refuse, advise --db-reset

# 5. Mutual exclusion
go run ./scripts/develop/ --db-rollback --db-reset  # should error

🤖 This review was generated by Coder Agents.

@mafredri mafredri force-pushed the develop-db-recovery branch 3 times, most recently from febf4ce to 196e2b0 Compare March 24, 2026 11:11

mafredri commented Mar 24, 2026

Copy link
Copy Markdown
Member Author

🤖 Agent on behalf of mafredri

Thanks for the thorough review. Addressed the findings:

Already fixed before this review (from earlier review rounds):

  • (1, P0) first-run bootstrap — handled via firstAdoption silent path
  • (4, P1) resetSchema not dropping _develop — already drops both schemas
  • (5, P1) dirty flag ignored — reads dirty, refuses if true, points at --db-reset
  • (8, P2) writeCache error swallowing — returns and propagates errors
  • (9, P2) one-directional consistency check — checks both directions
  • (11, P2) password encoding — uses url.QueryEscape

Accepted and fixed in this push:

  • (2, P1) moved recoverDB() before buildBinary() — no wasted build cycle
  • (3, P1) external PG --db-reset now requires CODER_DEV_DB_RESET=1 env confirmation
  • (6, P1) cache staleness — added updateMigrationTracking() after server health check, captures new migrations and updates cache on every run
  • (10, P2) os.Stat error handling — any error (not just NotExist) now triggers mismatch/rollback
  • (16, P3) INTEGERBIGINT to match schema_migrations

Accepted, not fixing now:

  • (7) --skip-db-recovery removed by design (exit-on-mismatch replaces it) — updated PR description
  • (12) missing BinaryRepositoryURL — only matters if binaries not cached (rare), tracking as follow-up
  • (13) test coverage — fair point, will add in follow-up
  • (17) manual password read vs config helper — config helper is in cli/config which pulls in more deps; manual read is intentional

Pushed back on:

  • (14) _develop on production — this is a dev tool run via develop.sh, not coder server. Developer explicitly set the PG URL. Adding a guard would be checking whether the developer meant to do what they explicitly told the tool to do
  • (15) 6-char prefix — 448 migrations strong, enforced by the project's own naming convention. A regex adds complexity for a hypothetical future change
  • (20) dead code len(name) < 7 — it guards name[:6] from panicking on short filenames that happen to end in .down.sql. Not dead, defensive

Watching the CI go green from the Finnish archipelago. Hyvää yötä! 🌊

@mafredri mafredri force-pushed the develop-db-recovery branch from 196e2b0 to 664ef62 Compare March 24, 2026 11:35
@mafredri mafredri marked this pull request as ready for review March 24, 2026 13:02
@coder-tasks

coder-tasks Bot commented Mar 24, 2026

Copy link
Copy Markdown
Contributor

Documentation Check

Updates Needed

  • docs/about/contributing/CONTRIBUTING.md - The Troubleshooting section (and/or the "Running Coder on development mode" section) should document the new --db-rollback / --db-reset flags and the migration-mismatch scenario (e.g., switching git branches leaves the DB at a version the current branch doesn't know about). Developers who hit this error will look here first. A short note pointing to these flags is enough — the error message itself explains the choice.

Automated review via Coder Tasks

@mafredri mafredri force-pushed the develop-db-recovery branch 3 times, most recently from 1009162 to 540243f Compare March 24, 2026 17:28
@mafredri mafredri changed the title feat(scripts/develop): add automatic database migration recovery feat: add automatic database migration recovery to scripts/develop Mar 24, 2026
@mafredri mafredri force-pushed the develop-db-recovery branch 2 times, most recently from 2dd6a24 to 105a427 Compare March 24, 2026 18:55
When developers switch branches, the database may have migrations
from the other branch that don't exist in the current binary.
This causes coder server to fail at startup, leaving developers
stuck.

The develop script now detects this before starting the server:

1. Connects to postgres (starts temp embedded instance for
   built-in postgres, or uses CODER_PG_CONNECTION_URL).
2. Compares DB version against the source's latest migration.
3. If DB is ahead, searches git history for the missing down
   SQL files and applies them in a transaction.
4. If git recovery fails (ambiguous versions across branches,
   missing files), falls back to resetting the public schema.

Also adds --reset-db and --skip-db-recovery flags.
@mafredri mafredri force-pushed the develop-db-recovery branch from 105a427 to 4d96462 Compare March 24, 2026 19:31
@mafredri mafredri merged commit 78b18e7 into main Mar 24, 2026
28 checks passed
@mafredri mafredri deleted the develop-db-recovery branch March 24, 2026 20:05
@github-actions github-actions Bot locked and limited conversation to collaborators Mar 24, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants