Skip to content

feat: support runtime recovery_target_* on post-bootstrap clusters#10662

Open
viveksk6 wants to merge 1 commit into
cloudnative-pg:mainfrom
viveksk6:feature/runtime-recovery-target
Open

feat: support runtime recovery_target_* on post-bootstrap clusters#10662
viveksk6 wants to merge 1 commit into
cloudnative-pg:mainfrom
viveksk6:feature/runtime-recovery-target

Conversation

@viveksk6
Copy link
Copy Markdown

Summary

Adds a runtime spec.recoveryTarget field that takes effect on a
post-bootstrap CNPG cluster, allowing operators to apply
recovery_target_time (or _lsn / _xid / _name) to an
already-running standby and have it pause at exactly that point —
without re-bootstrapping the cluster.

Closes #10634

Motivation

See the linked issue for full motivation. TL;DR: today no in-cluster
option exists for applying a recovery target to a running standby.
spec.replica.minApplyDelay doesn't bound apply at promotion;
bootstrap.recovery.recoveryTarget requires re-bootstrap;
ALTER SYSTEM SET recovery_target_time is blocked by CNPG hardening
of postgresql.auto.conf.

Design

  • New spec.recoveryTarget (reuses existing RecoveryTarget struct)
    and spec.recoveryTargetAction (defaults to pause).
  • Operator renders recovery_target_* GUCs into custom.conf (which
    CNPG already owns and writes freely), gated on IsReplica().
  • cnpg.config_sha256 changes → Postgres reload marks
    pg_settings.pending_restart=true (these GUCs are PGC_POSTMASTER).
  • Existing isInstanceNeedingRollout machinery handles the controlled
    in-place restart.
  • After promotion (replica.enabled=false), IsReplica() returns
    false and the GUCs are dropped from the rendered config — no
    leftover state on the new primary.

End-to-end validation

Tested on a live cluster:

Step Result
Patch spec.recoveryTarget.targetTime Operator triggers in-place restart (~12s)
recovery_target_* GUCs in custom.conf Present while standby, gated on IsReplica
Replay reaches target Paused before first commit past targetTime
pg_get_wal_replay_pause_state() paused
Promote via replica.enabled=false New timeline at paused LSN
Post-target transactions in new primary None (post-target CREATE TABLE was excluded)
GUCs after promotion Dropped from custom.conf (gated by IsReplica)

Deferred (happy to add in follow-ups)

  • Webhook validation (one-of mutual exclusion on
    targetTime/targetLSN/etc.).
  • Populate status.recoveryTargetStatus from instance manager probes.
  • E2E test under tests/e2e/.

Checklist

  • feat: Conventional Commits title
  • DCO Signed-off-by: on every commit
  • make generate + make manifests regenerated artifacts
  • go build ./..., go vet, existing unit tests pass
  • Webhook validation (follow-up)
  • Status reporting (follow-up)
  • E2E test (follow-up)

Open to feedback on field placement and API shape before adding the
follow-ups.

@viveksk6 viveksk6 requested review from a team, NiccoloFei, jsilvela and litaocdl as code owners May 11, 2026 08:30
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 11, 2026
@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.28 release-1.29 labels May 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@dosubot dosubot Bot added do not backport This PR must not be backported - it will be in the next minor release enhancement 🪄 New feature or request labels May 11, 2026
@viveksk6 viveksk6 force-pushed the feature/runtime-recovery-target branch from 87279cf to 76f7e20 Compare May 11, 2026 08:42
CloudNativePG today only honors a recovery target at bootstrap, via
spec.bootstrap.recovery.recoveryTarget. For controlled cutover at a
fixed wall-clock time, operators need a way to apply a recovery
target to an already-running standby cluster without a full
re-bootstrap.
This change adds:
  - spec.recoveryTarget (reuses the existing RecoveryTarget struct)
  - spec.recoveryTargetAction: pause | promote | shutdown
When set, the operator renders the corresponding recovery_target_*
GUCs into custom.conf (operator-managed, bypassing the read-only
postgresql.auto.conf hardening). The existing cnpg.config_sha256
mechanism marks pg_settings.pending_restart=true, and the rollout
machinery performs a controlled in-place restart so PostgreSQL
applies the recovery target at server start (PGC_POSTMASTER).
The GUCs are gated on cluster.IsReplica(), so they are silently
ignored on primaries and automatically dropped from the rendered
config after promotion.
Validated end-to-end on a real cluster: standby paused before the
first commit past targetTime, promotion via replica.enabled=false
preserved that exact state on a new timeline, and post-target
transactions were excluded from the new primary's data.
Closes cloudnative-pg#10634

Signed-off-by: viveksk6 <viveksk69@gmail.com>
@viveksk6 viveksk6 force-pushed the feature/runtime-recovery-target branch from 76f7e20 to 6a45105 Compare May 13, 2026 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-requested ◀️ This pull request should be backported to all supported releases do not backport This PR must not be backported - it will be in the next minor release enhancement 🪄 New feature or request release-1.25 release-1.28 release-1.29 size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Allow setting a recovery target on a post-bootstrap Cluster

2 participants