Skip to content

Commit 2adf303

Browse files
michaelpqXuneng Zhou
andcommitted
Fix timing-dependent failure in recovery test 004_timeline_switch
The test introduced by 17b2d5e verifies that a WAL receiver survives across a timeline jump by searching the server logs for termination messages. However, it called restart() before the timeline switch, which kills the WAL receiver and may log the exact message being checked, hence failing the test. As TAP tests reuse the same log file across restarts, a rotate_logfile() is used before the restart so as the log matching check is not impacted by log entries generated by a previous shutdown. Recent changes to file handle inheritance altered I/O timing enough to make this fail consistently while testing another patch. While on it, this adds an extra check based on a PID comparison. This test may lead to false positives as it could be possible that the WAL receiver has processed a timeline jump before the initial PID is grabbed, but it should be good enough in most cases. Like 17b2d5e, backpatch down to v13. Author: Bryan Green <dbryan.green@gmail.com> Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/9d00b597-d64a-4f1e-802e-90f9dc394c70@gmail.com Backpatch-through: 13
1 parent 7609b34 commit 2adf303

File tree

1 file changed

+20
-1
lines changed

1 file changed

+20
-1
lines changed

src/test/recovery/t/004_timeline_switch.pl

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
use File::Path qw(rmtree);
55
use PostgresNode;
66
use TestLib;
7-
use Test::More tests => 4;
7+
use Test::More tests => 5;
88

99
$ENV{PGDATABASE} = 'postgres';
1010

@@ -55,8 +55,19 @@
5555
'postgresql.conf', qq(
5656
primary_conninfo='$connstr_1'
5757
));
58+
59+
# Rotate logfile before restarting, for the log checks done below.
60+
$node_standby_2->rotate_logfile;
5861
$node_standby_2->restart;
5962

63+
# Wait for walreceiver to reconnect after the restart. We want to
64+
# verify that after reconnection, the walreceiver stays alive during
65+
# the timeline switch.
66+
$node_standby_2->poll_query_until('postgres',
67+
"SELECT EXISTS(SELECT 1 FROM pg_stat_wal_receiver)");
68+
my $wr_pid_before_switch = $node_standby_2->safe_psql('postgres',
69+
"SELECT pid FROM pg_stat_wal_receiver");
70+
6071
# Insert some data in standby 1 and check its presence in standby 2
6172
# to ensure that the timeline switch has been done.
6273
$node_standby_1->safe_psql('postgres',
@@ -77,6 +88,14 @@
7788
),
7889
'WAL receiver should not be stopped across timeline jumps');
7990

91+
# Verify that the walreceiver process stayed alive across the timeline
92+
# switch, check its PID.
93+
my $wr_pid_after_switch = $node_standby_2->safe_psql('postgres',
94+
"SELECT pid FROM pg_stat_wal_receiver");
95+
96+
is($wr_pid_before_switch, $wr_pid_after_switch,
97+
'WAL receiver PID matches across timeline jumps');
98+
8099
# Ensure that a standby is able to follow a master on a newer timeline
81100
# when WAL archiving is enabled.
82101

0 commit comments

Comments
 (0)