Skip to content

Commit e3714dc

Browse files
committed
Fix regression with slot invalidation checks
This commit reverts 818fefd, that has been introduced to address a an instability in some of the TAP tests due to the presence of random standby snapshot WAL records, when slots are invalidated by InvalidatePossiblyObsoleteSlot(). Anyway, this commit had also the consequence of introducing a behavior regression. After 818fefd, the code may determine that a slot needs to be invalidated while it may not require one: the slot may have moved from a conflicting state to a non-conflicting state between the moment when the mutex is released and the moment when we recheck the slot, in InvalidatePossiblyObsoleteSlot(). Hence, the invalidations may be more aggressive than they actually have to. 105b2cb has tackled the test instability in a way that should be hopefully sufficient for the buildfarm, even for slow members: - In v18, the test relies on an injection point that bypasses the creation of the random records generated for standby snapshots, eliminating the random factor that impacted the test. This option was not available when 818fefd was discussed. - In v16 and v17, the problem was bypassed by disallowing a slot to become active in some of the scenarios tested. While on it, this commit adds a comment to document that it is fine for a recheck to use xmin and LSN values stored in the slot, without storing and reusing them across multiple checks. Reported-by: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/f492465f-657e-49af-8317-987460cb68b0.mengjuan.cmj@alibaba-inc.com Backpatch-through: 16
1 parent cdc04a6 commit e3714dc

File tree

1 file changed

+14
-33
lines changed

1 file changed

+14
-33
lines changed

src/backend/replication/slot.c

Lines changed: 14 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1350,11 +1350,6 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
13501350
{
13511351
int last_signaled_pid = 0;
13521352
bool released_lock = false;
1353-
bool terminated = false;
1354-
TransactionId initial_effective_xmin = InvalidTransactionId;
1355-
TransactionId initial_catalog_effective_xmin = InvalidTransactionId;
1356-
XLogRecPtr initial_restart_lsn = InvalidXLogRecPtr;
1357-
ReplicationSlotInvalidationCause conflict_prev PG_USED_FOR_ASSERTS_ONLY = RS_INVAL_NONE;
13581353

13591354
for (;;)
13601355
{
@@ -1389,24 +1384,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
13891384
*/
13901385
if (s->data.invalidated == RS_INVAL_NONE)
13911386
{
1392-
/*
1393-
* The slot's mutex will be released soon, and it is possible that
1394-
* those values change since the process holding the slot has been
1395-
* terminated (if any), so record them here to ensure that we
1396-
* would report the correct conflict cause.
1397-
*/
1398-
if (!terminated)
1399-
{
1400-
initial_restart_lsn = s->data.restart_lsn;
1401-
initial_effective_xmin = s->effective_xmin;
1402-
initial_catalog_effective_xmin = s->effective_catalog_xmin;
1403-
}
1404-
14051387
switch (cause)
14061388
{
14071389
case RS_INVAL_WAL_REMOVED:
1408-
if (initial_restart_lsn != InvalidXLogRecPtr &&
1409-
initial_restart_lsn < oldestLSN)
1390+
if (s->data.restart_lsn != InvalidXLogRecPtr &&
1391+
s->data.restart_lsn < oldestLSN)
14101392
conflict = cause;
14111393
break;
14121394
case RS_INVAL_HORIZON:
@@ -1415,12 +1397,12 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
14151397
/* invalid DB oid signals a shared relation */
14161398
if (dboid != InvalidOid && dboid != s->data.database)
14171399
break;
1418-
if (TransactionIdIsValid(initial_effective_xmin) &&
1419-
TransactionIdPrecedesOrEquals(initial_effective_xmin,
1400+
if (TransactionIdIsValid(s->effective_xmin) &&
1401+
TransactionIdPrecedesOrEquals(s->effective_xmin,
14201402
snapshotConflictHorizon))
14211403
conflict = cause;
1422-
else if (TransactionIdIsValid(initial_catalog_effective_xmin) &&
1423-
TransactionIdPrecedesOrEquals(initial_catalog_effective_xmin,
1404+
else if (TransactionIdIsValid(s->effective_catalog_xmin) &&
1405+
TransactionIdPrecedesOrEquals(s->effective_catalog_xmin,
14241406
snapshotConflictHorizon))
14251407
conflict = cause;
14261408
break;
@@ -1433,13 +1415,6 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
14331415
}
14341416
}
14351417

1436-
/*
1437-
* The conflict cause recorded previously should not change while the
1438-
* process owning the slot (if any) has been terminated.
1439-
*/
1440-
Assert(!(conflict_prev != RS_INVAL_NONE && terminated &&
1441-
conflict_prev != conflict));
1442-
14431418
/* if there's no conflict, we're done */
14441419
if (conflict == RS_INVAL_NONE)
14451420
{
@@ -1514,8 +1489,6 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
15141489
(void) kill(active_pid, SIGTERM);
15151490

15161491
last_signaled_pid = active_pid;
1517-
terminated = true;
1518-
conflict_prev = conflict;
15191492
}
15201493

15211494
/* Wait until the slot is released. */
@@ -1526,6 +1499,14 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
15261499
* Re-acquire lock and start over; we expect to invalidate the
15271500
* slot next time (unless another process acquires the slot in the
15281501
* meantime).
1502+
*
1503+
* Note: It is possible for a slot to advance its restart_lsn or
1504+
* xmin values sufficiently between when we release the mutex and
1505+
* when we recheck, moving from a conflicting state to a non
1506+
* conflicting state. This is intentional and safe: if the slot
1507+
* has caught up while we're busy here, the resources we were
1508+
* concerned about (WAL segments or tuples) have not yet been
1509+
* removed, and there's no reason to invalidate the slot.
15291510
*/
15301511
LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
15311512
continue;

0 commit comments

Comments
 (0)