Skip to content

Fix WAL data loss on instance hibernation/shutdown (#10341)#10352

Open
abhayclasher wants to merge 1 commit intocloudnative-pg:mainfrom
abhayclasher:fix/wal-data-loss-10341
Open

Fix WAL data loss on instance hibernation/shutdown (#10341)#10352
abhayclasher wants to merge 1 commit intocloudnative-pg:mainfrom
abhayclasher:fix/wal-data-loss-10341

Conversation

@abhayclasher
Copy link
Copy Markdown

@abhayclasher abhayclasher commented Mar 22, 2026

While looking at the WAL archiving logic, I noticed that when a CNPG instance shuts down normally, any final WAL segments sitting in .ready state never get pushed to S3. The archiver only runs during switchover/demotion, not during regular shutdown. This is a problem especially during cluster hibernation — you lose the last batch of transactions.

The fix is straightforward: call ArchiveAllReadyWALs() after the PostgreSQL shutdown completes, same way we do for switchover. Ensures those final segments don't sit orphaned on local storage.

Fixes #10341

@abhayclasher abhayclasher requested a review from a team as a code owner March 22, 2026 03:56
@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Mar 22, 2026
@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.27 release-1.28 labels Mar 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@dosubot dosubot bot added the bug 🐛 Something isn't working label Mar 22, 2026
When a CNPG instance is shut down (e.g. during hibernation), the last
WAL segments might remain on local storage and not be archived to
the object store. This is because ArchiveAllReadyWALs() was only
triggered during switchover/demotion.

This commit ensures that ArchiveAllReadyWALs() is also called after
a successful smart/fast shutdown in the lifecycle loop, preventing
data loss of the final WAL segments.

References: cloudnative-pg#10341
Signed-off-by: Abhay Kumar <abhaypro.cloud@gmail.com>
@abhayclasher abhayclasher force-pushed the fix/wal-data-loss-10341 branch from 66a4282 to 0c4f837 Compare March 22, 2026 04:09
@aglees
Copy link
Copy Markdown

aglees commented Mar 23, 2026

👋 I have a feeling that the difference with #10345, is important to efficacy of the solution.

I think the <-ctx.Done() path (line 122), ctx is already cancelled. This means that in this PR context will propagate it through to gRPC calls to the plugin sidecar. Those calls will fail immediately with context canceled.

Due the select block in lifecycle.go it might be that this works when the select happens to pick the SIGTERM case, which is non-deterministic. For our hibernation scenario specifically, it would silently fail to archive WALs roughly half the time, with no indication in the logs beyond a context-cancelled error that could easily be overlooked.

WDYT?

@abhayclasher2
Copy link
Copy Markdown

abhayclasher2 commented Mar 24, 2026

Hey @aglees — that's a solid catch. I'm at work right now so haven't had a chance to dig in properly, but I'll update this from my main account tonight with a proper fix (using a fresh context instead of the cancelled one). Will push the changes then. Thanks for flagging it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-requested ◀️ This pull request should be backported to all supported releases bug 🐛 Something isn't working release-1.25 release-1.27 release-1.28 size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Data loss when restoring from WAL archives after hibernating a CNPG cluster

4 participants