feat: support runtime recovery_target_* on post-bootstrap clusters#10662
Open
viveksk6 wants to merge 1 commit into
Open
feat: support runtime recovery_target_* on post-bootstrap clusters#10662viveksk6 wants to merge 1 commit into
viveksk6 wants to merge 1 commit into
Conversation
Contributor
|
❗ By default, the pull request is configured to backport to all release branches.
|
87279cf to
76f7e20
Compare
CloudNativePG today only honors a recovery target at bootstrap, via spec.bootstrap.recovery.recoveryTarget. For controlled cutover at a fixed wall-clock time, operators need a way to apply a recovery target to an already-running standby cluster without a full re-bootstrap. This change adds: - spec.recoveryTarget (reuses the existing RecoveryTarget struct) - spec.recoveryTargetAction: pause | promote | shutdown When set, the operator renders the corresponding recovery_target_* GUCs into custom.conf (operator-managed, bypassing the read-only postgresql.auto.conf hardening). The existing cnpg.config_sha256 mechanism marks pg_settings.pending_restart=true, and the rollout machinery performs a controlled in-place restart so PostgreSQL applies the recovery target at server start (PGC_POSTMASTER). The GUCs are gated on cluster.IsReplica(), so they are silently ignored on primaries and automatically dropped from the rendered config after promotion. Validated end-to-end on a real cluster: standby paused before the first commit past targetTime, promotion via replica.enabled=false preserved that exact state on a new timeline, and post-target transactions were excluded from the new primary's data. Closes cloudnative-pg#10634 Signed-off-by: viveksk6 <viveksk69@gmail.com>
76f7e20 to
6a45105
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a runtime
spec.recoveryTargetfield that takes effect on apost-bootstrap CNPG cluster, allowing operators to apply
recovery_target_time(or_lsn/_xid/_name) to analready-running standby and have it pause at exactly that point —
without re-bootstrapping the cluster.
Closes #10634
Motivation
See the linked issue for full motivation. TL;DR: today no in-cluster
option exists for applying a recovery target to a running standby.
spec.replica.minApplyDelaydoesn't bound apply at promotion;bootstrap.recovery.recoveryTargetrequires re-bootstrap;ALTER SYSTEM SET recovery_target_timeis blocked by CNPG hardeningof
postgresql.auto.conf.Design
spec.recoveryTarget(reuses existingRecoveryTargetstruct)and
spec.recoveryTargetAction(defaults topause).recovery_target_*GUCs intocustom.conf(whichCNPG already owns and writes freely), gated on
IsReplica().cnpg.config_sha256changes → Postgres reload markspg_settings.pending_restart=true(these GUCs arePGC_POSTMASTER).isInstanceNeedingRolloutmachinery handles the controlledin-place restart.
replica.enabled=false),IsReplica()returnsfalse and the GUCs are dropped from the rendered config — no
leftover state on the new primary.
End-to-end validation
Tested on a live cluster:
spec.recoveryTarget.targetTimerecovery_target_*GUCs incustom.confpg_get_wal_replay_pause_state()pausedreplica.enabled=falseDeferred (happy to add in follow-ups)
targetTime/targetLSN/etc.).status.recoveryTargetStatusfrom instance manager probes.tests/e2e/.Checklist
feat:Conventional Commits titleSigned-off-by:on every commitmake generate+make manifestsregenerated artifactsgo build ./...,go vet, existing unit tests passOpen to feedback on field placement and API shape before adding the
follow-ups.