Skip to content

Commit a1b8b3b

Browse files
committed
Fix unconditional WAL receiver shutdown during stream-archive transition
Commit b4f584f (affecting v15~, later backpatched down to 13 as of 3635a0a) introduced an unconditional WAL receiver shutdown when switching from streaming to archive WAL sources. This causes problems during a timeline switch, when a WAL receiver enters WALRCV_WAITING state but remains alive, waiting for instructions. The unconditional shutdown can break some monitoring scenarios as the WAL receiver gets repeatedly terminated and re-spawned, causing pg_stat_wal_receiver.status to show a "streaming" instead of "waiting" status, masking the fact that the WAL receiver is waiting for a new TLI and a new LSN to be able to continue streaming. This commit changes the WAL receiver behavior so as the shutdown becomes conditional, with InstallXLogFileSegmentActive being always reset to prevent the regression fixed by b4f584f: only terminate the WAL receiver when it is actively streaming (WALRCV_STREAMING, WALRCV_STARTING, or WALRCV_RESTARTING). When in WALRCV_WAITING state, just reset InstallXLogFileSegmentActive flag to allow archive restoration without killing the process. WALRCV_STOPPED and WALRCV_STOPPING are not reachable states in this code path. For the latter, the startup process is the one in charge of setting WALRCV_STOPPING via ShutdownWalRcv(), waiting for the WAL receiver to reach a WALRCV_STOPPED state after switching walRcvState, so WaitForWALToBecomeAvailable() cannot be reached while a WAL receiver is in a WALRCV_STOPPING state. A regression test is added to check that a WAL receiver is not stopped on timeline jump, that fails when the fix of this commit is reverted. Reported-by: Ryan Bird <ryanzxg@gmail.com> Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org Backpatch-through: 13
1 parent faba259 commit a1b8b3b

File tree

2 files changed

+31
-6
lines changed

2 files changed

+31
-6
lines changed

src/backend/access/transam/xlog.c

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -946,6 +946,7 @@ static int XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
946946
int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
947947
static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
948948
bool fetching_ckpt, XLogRecPtr tliRecPtr);
949+
static void ResetInstallXLogFileSegmentActive(void);
949950
static void XLogShutdownWalRcv(void);
950951
static int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
951952
static void XLogFileClose(void);
@@ -12837,8 +12838,18 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
1283712838
* Before we leave XLOG_FROM_STREAM state, make sure that
1283812839
* walreceiver is not active, so that it won't overwrite
1283912840
* WAL that we restore from archive.
12841+
* If walreceiver is actively streaming (or attempting to
12842+
* connect), we must shut it down. However, if it's
12843+
* already in WAITING state (e.g., due to timeline
12844+
* divergence), we only need to reset the install flag to
12845+
* allow archive restoration.
1284012846
*/
12841-
XLogShutdownWalRcv();
12847+
if (WalRcvStreaming())
12848+
XLogShutdownWalRcv();
12849+
else
12850+
{
12851+
ResetInstallXLogFileSegmentActive();
12852+
}
1284212853

1284312854
/*
1284412855
* Before we sleep, re-scan for possible new timelines if
@@ -13191,17 +13202,23 @@ StartupRequestWalReceiverRestart(void)
1319113202
}
1319213203
}
1319313204

13194-
/* Thin wrapper around ShutdownWalRcv(). */
13205+
/* Disable WAL file recycling and preallocation. */
1319513206
static void
13196-
XLogShutdownWalRcv(void)
13207+
ResetInstallXLogFileSegmentActive(void)
1319713208
{
13198-
ShutdownWalRcv();
13199-
1320013209
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
1320113210
XLogCtl->InstallXLogFileSegmentActive = false;
1320213211
LWLockRelease(ControlFileLock);
1320313212
}
1320413213

13214+
/* Thin wrapper around ShutdownWalRcv(). */
13215+
static void
13216+
XLogShutdownWalRcv(void)
13217+
{
13218+
ShutdownWalRcv();
13219+
ResetInstallXLogFileSegmentActive();
13220+
}
13221+
1320513222
/*
1320613223
* Determine what log level should be used to report a corrupt WAL record
1320713224
* in the current WAL page, previously read by XLogPageRead().

src/test/recovery/t/004_timeline_switch.pl

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
use File::Path qw(rmtree);
88
use PostgresNode;
99
use TestLib;
10-
use Test::More tests => 3;
10+
use Test::More tests => 4;
1111

1212
$ENV{PGDATABASE} = 'postgres';
1313

@@ -71,6 +71,14 @@
7171
$node_standby_2->safe_psql('postgres', "SELECT count(*) FROM tab_int");
7272
is($result, qq(2000), 'check content of standby 2');
7373

74+
# Check the logs, WAL receiver should not have been stopped while
75+
# transitioning to its new timeline. There is no need to rely on an
76+
# offset in this check of the server logs: a new log file is used on
77+
# node restart when primary_conninfo is updated above.
78+
ok( !$node_standby_2->log_contains(
79+
"FATAL: .* terminating walreceiver process due to administrator command"
80+
),
81+
'WAL receiver should not be stopped across timeline jumps');
7482

7583
# Ensure that a standby is able to follow a primary on a newer timeline
7684
# when WAL archiving is enabled.

0 commit comments

Comments
 (0)