Skip to content

fix(lifecycle): archive remaining WAL files after shutdown#10345

Open
aglees wants to merge 2 commits intocloudnative-pg:mainfrom
aglees:dev/10341-data-loss-hibernate-cluster
Open

fix(lifecycle): archive remaining WAL files after shutdown#10345
aglees wants to merge 2 commits intocloudnative-pg:mainfrom
aglees:dev/10341-data-loss-hibernate-cluster

Conversation

@aglees
Copy link
Copy Markdown

@aglees aglees commented Mar 20, 2026

During pod termination (hibernation, rolling update, node drain), PostgreSQL's built-in archiver may be killed before it can archive the final WAL segment. Add a call to ArchiveAllReadyWALs() after each TryShuttingDownSmartFast() in both shutdown paths (context cancellation and SIGTERM signal) to sweep any remaining .ready WAL files via the plugin sidecar, which outlives the postgres container as a native sidecar.

A fresh context.Background() is used for the WAL sweep because the parent context may already be cancelled in the context-cancellation path, which would cause gRPC calls to the sidecar to fail immediately.

Errors are logged but do not prevent the instance manager from exiting, consistent with existing shutdown error handling.

Closes #10341

@aglees aglees requested a review from a team as a code owner March 20, 2026 11:33
@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Mar 20, 2026
@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.27 release-1.28 labels Mar 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@dosubot dosubot bot added the bug 🐛 Something isn't working label Mar 20, 2026
During pod termination (hibernation, rolling update, node drain),
PostgreSQL's built-in archiver may be killed before it can archive
the final WAL segment. Add a call to ArchiveAllReadyWALs() after
each TryShuttingDownSmartFast() in both shutdown paths (context
cancellation and SIGTERM signal) to sweep any remaining .ready WAL
files via the plugin sidecar, which outlives the postgres container
as a native sidecar.

A fresh context.Background() is used for the WAL sweep because the
parent context may already be cancelled in the context-cancellation
path, which would cause gRPC calls to the sidecar to fail
immediately.

Errors are logged but do not prevent the instance manager from
exiting, consistent with existing shutdown error handling.

Closes cloudnative-pg#10341

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Gleeson <Andrew_Gleeson@external.mckinsey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-requested ◀️ This pull request should be backported to all supported releases bug 🐛 Something isn't working release-1.25 release-1.27 release-1.28 size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Data loss when restoring from WAL archives after hibernating a CNPG cluster

2 participants