feat: add automatic database migration recovery to scripts/develop#23466
Conversation
8c19c94 to
145ead6
Compare
Deep Review:
|
| # | Sev | Category | Issue |
|---|---|---|---|
| 1 | P0 | Bootstrapping | First run with existing DB is a hard error — every developer breaks on upgrade |
| 2 | P1 | UX | Recovery check runs after buildBinary() — two full build cycles on mismatch |
| 3 | P1 | Safety | handleExternalPG + --db-reset runs DROP SCHEMA CASCADE on arbitrary external DB with no confirmation |
| 4 | P1 | Correctness | resetSchema doesn't drop _develop schema — stale tracking survives --db-reset |
| 5 | P1 | Correctness | currentMigrationVersion ignores the dirty flag — rollback against half-applied schema |
| 6 | P1 | Detection gap | Cache only written during recovery — migrations applied by server between runs are invisible |
| 7 | P1 | Missing feature | --skip-db-recovery mentioned in PR description is not implemented |
| 8 | P2 | Reliability | writeCache silently swallows all errors → phantom mismatch on next run |
| 9 | P2 | Correctness | verifyDBCacheConsistency is one-directional (DB→cache), misses cache-side corruption |
| 10 | P2 | Correctness | os.Stat errors other than NotExist treated as "file exists" — false negative in findRollbacks and cacheShowsMismatch |
| 11 | P2 | Correctness | Password not URL-encoded in startTempPostgres connection string — breaks on special chars |
| 12 | P2 | Consistency | Embedded PG config missing BinaryRepositoryURL (flaky Maven mirror) and Encoding("UTF8") vs cli/server.go |
| 13 | P2 | Testing | Only cleanStalePIDFile tested (~3% coverage) — core algorithm has zero tests |
| 14 | P2 | Safety | No guard against _develop schema being created in a production DB via CODER_PG_CONNECTION_URL |
| 15 | P3 | Fragility | 6-char prefix parsing (name[:6]) — breaks if convention changes, doesn't match golang-migrate parsing |
| 16 | P3 | Consistency | Tracking table uses INTEGER but schema_migrations uses BIGINT |
| 17 | P3 | Consistency | Manual password file read vs. existing config.Root().PostgresPassword().Read() helper |
| 18 | P3 | DX | Error message alignment inconsistent; multi-layer wrapping buries actionable text |
| 19 | P3 | DX | No documentation, no explanation of migration_cache.json |
| 20 | P3 | Code quality | Dead code: len(name) < 7 can never trigger (.down.sql suffix guarantees ≥ 9) |
| 21 | P4 | Robustness | Ephemeral port TOCTOU race between listener.Close() and ep.Start() |
| 22 | P4 | DX | Logger(nil) on embedded PG silences diagnostics vs. production Logger(writer) pattern |
| 23 | P4 | Safety | PID reuse can cause false-positive stale PID detection |
Key Details
#1 (P0) — First-run bootstrap breaks every existing dev. When no migration_cache.json exists, cacheShowsMismatch returns true. Both handleBuiltinPG and handleExternalPG then error out demanding --db-rollback. Nothing is actually wrong with the DB. Fix: when no cache exists but the DB is healthy, auto-bootstrap the cache silently.
#2 (P1) — Two-build-cycle penalty. recoverDB() is called after buildBinary() (main.go line 358). A developer hits the mismatch error after waiting for a full make -j build, then must re-run with --db-rollback triggering another full build. Either move recoverDB() before buildBinary(), or prompt interactively.
#3 (P1) — Unguarded DROP SCHEMA CASCADE on external databases. handleExternalPG + --db-reset executes DROP SCHEMA public CASCADE on whatever CODER_PG_CONNECTION_URL points at — no confirmation, no safety check. If that URL points at a shared staging database (or worse, accidentally production), it's silently wiped. The built-in PG path is fine (os.RemoveAll on a local directory), but the external path needs an interactive confirmation prompt (e.g., "You are about to DROP all schemas in $HOST/$DBNAME. Type the database name to confirm:"), or at minimum refuse to reset an external DB without an explicit --force flag.
#4 (P1) — resetSchema doesn't drop _develop. After --db-reset, _develop.applied_migrations survives with stale entries. Next syncAndRecover sees a tracking table that disagrees with the freshly-empty schema_migrations. Add DROP SCHEMA IF EXISTS _develop CASCADE.
#5 (P1) — dirty flag ignored. currentMigrationVersion reads schema_migrations but ignores the dirty column. If the DB is dirty (half-applied migration), executing stored down_sql against an inconsistent schema will fail unpredictably or make things worse. Should SELECT version, dirty and refuse to recover when dirty, advising --db-reset.
#6 (P1) — Cache goes stale between runs. The cache is only written inside syncAndRecover(). Migrations applied by the server during normal operation are never captured. Scenario: developer bootstraps (cache captures v1–448), switches to feature branch (server applies v449), switches back (v449's file gone), runs develop — cache only knows v1–448, reports no mismatch. Server then fails at EnsureClean(). This is exactly the scenario the feature is meant to prevent.
#13 (P2) — ~3% test coverage. 47 lines of tests for 587 lines of schema-modifying code. The only tested function (cleanStalePIDFile) is a minor utility. Pure functions like contiguousFromTop, cacheShowsMismatch, readCache/writeCache are trivially unit-testable. The P0 first-run bug would have been caught by TestCacheShowsMismatch_NoCacheFile.
Questions
- Is there a plan to update the cache after normal server startup (not just during recovery)? Without this, finding # 6 means the feature only catches mismatches from the initial snapshot.
- The PR description mentions
--skip-db-recoverybut it's not implemented. Intentional or oversight? - Why cache-based detection over always querying the DB? For external PG, a direct query is cheap. For built-in PG, the cache avoids starting temp PG, but staleness seems like a worse trade-off.
Suggested Validation Plan
# 1. Verify first-run experience (P0 regression)
rm -rf .coderv2/postgres/migration_cache.json
go run ./scripts/develop/ # Should NOT error for existing devs
# 2. Full branch-switch scenario
go run ./scripts/develop/ --db-rollback # bootstrap
git checkout feature-branch-with-new-migration
go run ./scripts/develop/ # server applies migration
git checkout main # migration file gone
go run ./scripts/develop/ # MUST detect mismatch
# 3. --db-reset drops _develop schema
CODER_PG_CONNECTION_URL=... go run ./scripts/develop/ --db-reset
psql $CODER_PG_CONNECTION_URL -c "SELECT * FROM _develop.applied_migrations" # should fail
# 4. Dirty DB handling
psql -c "UPDATE schema_migrations SET dirty = true"
go run ./scripts/develop/ --db-rollback # should refuse, advise --db-reset
# 5. Mutual exclusion
go run ./scripts/develop/ --db-rollback --db-reset # should error🤖 This review was generated by Coder Agents.
febf4ce to
196e2b0
Compare
|
🤖 Agent on behalf of mafredri Thanks for the thorough review. Addressed the findings: Already fixed before this review (from earlier review rounds):
Accepted and fixed in this push:
Accepted, not fixing now:
Pushed back on:
|
196e2b0 to
664ef62
Compare
Documentation CheckUpdates Needed
Automated review via Coder Tasks |
1009162 to
540243f
Compare
2dd6a24 to
105a427
Compare
When developers switch branches, the database may have migrations from the other branch that don't exist in the current binary. This causes coder server to fail at startup, leaving developers stuck. The develop script now detects this before starting the server: 1. Connects to postgres (starts temp embedded instance for built-in postgres, or uses CODER_PG_CONNECTION_URL). 2. Compares DB version against the source's latest migration. 3. If DB is ahead, searches git history for the missing down SQL files and applies them in a transaction. 4. If git recovery fails (ambiguous versions across branches, missing files), falls back to resetting the public schema. Also adds --reset-db and --skip-db-recovery flags.
105a427 to
4d96462
Compare
🤖 Agent on behalf of mafredri
Problem
When switching branches, the database may have migrations from the other branch.
coder servercallsEnsureClean()which fails because the DB version is ahead of what the current binary knows. The server exits. The developer's test data (workspaces, custom templates, provisioner configs built up over days/weeks) is at risk — the only existing fix isrm -rf .coderv2/postgres.Design
Two data stores work together:
Cache file (
migration_cache.json) — a JSON map of{version: filename}for every migration that has been applied. Lives next to the postgres data directory. Its only job is answering "did any previously-applied migration disappear from disk?" without starting postgres. Read on every startup, written after the server is healthy.DB tracking table (
_develop.applied_migrations) — stores{version, filename, down_sql}for every applied migration. Lives in a_developschema in the database. Contains the actual rollback SQL. Only accessed when recovery is needed (temp postgres started) or during the post-health-check capture step.Conflict detection (cache vs disk)
The cache records what was on disk the last time the server ran successfully. On startup, each cache entry is checked against the migration directory:
Up().--db-reset). The server starts normally and the cache is written after the health check.The cache only detects that something changed. It doesn't attempt recovery — it exits with an error telling the developer to choose
--db-rollbackor--db-reset.Common ancestor detection (tracking table vs disk)
When
--db-rollbackis used, the tracking table determines what to undo. The common ancestor is the highest version where the tracking table and disk agree:Rollback only proceeds if the missing versions form a contiguous block from the top. Gaps (e.g. 448 and 446 missing but 447 present) indicate version number reuse across branches — not safely rollable, points at
--db-reset.Untracked migrations (schema_migrations version > max tracked version) mean someone ran
coder serveroutsidedevelop.sh. No down SQL stored — points at--db-reset.State machine