Skip to content

Commit d3ceb20

Browse files
author
Amit Kapila
committed
Prevent invalidation of newly created replication slots.
A race condition could cause a newly created replication slot to become invalidated between WAL reservation and a checkpoint. Previously, if the required WAL was removed, we retried the reservation process. However, the slot could still be invalidated before the retry if the WAL was not yet removed but the checkpoint advanced the redo pointer beyond the slot's intended restart LSN and computed the minimum LSN that needs to be preserved for the slots. The fix is to acquire an exclusive lock on ReplicationSlotAllocationLock during WAL reservation to serialize WAL reservation and checkpoint's minimum restart_lsn computation. This ensures that, if WAL reservation occurs first, the checkpoint waits until restart_lsn is updated before removing WAL. If the checkpoint runs first, subsequent WAL reservations pick a position at or after the latest checkpoint's redo pointer. We can't use the same fix for branch 17 and prior because commit 2090edc changed to compute to the minimum restart_LSN among slot's at the beginning of checkpoint (or restart point). The fix for 17 and prior branches is under discussion and will be committed separately. Reported-by: suyu.cmj <mengjuan.cmj@alibaba-inc.com> Author: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Vitaly Davydov <v.davydov@postgrespro.ru> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 18 Discussion: https://postgr.es/m/5e045179-236f-4f8f-84f1-0f2566ba784c.mengjuan.cmj@alibaba-inc.com
1 parent 18b3493 commit d3ceb20

File tree

1 file changed

+54
-46
lines changed

1 file changed

+54
-46
lines changed

src/backend/replication/slot.c

Lines changed: 54 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1530,63 +1530,65 @@ void
15301530
ReplicationSlotReserveWal(void)
15311531
{
15321532
ReplicationSlot *slot = MyReplicationSlot;
1533+
XLogSegNo segno;
1534+
XLogRecPtr restart_lsn;
15331535

15341536
Assert(slot != NULL);
15351537
Assert(slot->data.restart_lsn == InvalidXLogRecPtr);
15361538
Assert(slot->last_saved_restart_lsn == InvalidXLogRecPtr);
15371539

15381540
/*
1539-
* The replication slot mechanism is used to prevent removal of required
1540-
* WAL. As there is no interlock between this routine and checkpoints, WAL
1541-
* segments could concurrently be removed when a now stale return value of
1542-
* ReplicationSlotsComputeRequiredLSN() is used. In the unlikely case that
1543-
* this happens we'll just retry.
1541+
* The replication slot mechanism is used to prevent the removal of
1542+
* required WAL.
1543+
*
1544+
* Acquire an exclusive lock to prevent the checkpoint process from
1545+
* concurrently computing the minimum slot LSN (see
1546+
* CheckPointReplicationSlots). This ensures that the WAL reserved for
1547+
* replication cannot be removed during a checkpoint.
1548+
*
1549+
* The mechanism is reliable because if WAL reservation occurs first, the
1550+
* checkpoint must wait for the restart_lsn update before determining the
1551+
* minimum non-removable LSN. On the other hand, if the checkpoint happens
1552+
* first, subsequent WAL reservations will select positions at or beyond
1553+
* the redo pointer of that checkpoint.
15441554
*/
1545-
while (true)
1546-
{
1547-
XLogSegNo segno;
1548-
XLogRecPtr restart_lsn;
1555+
LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
15491556

1550-
/*
1551-
* For logical slots log a standby snapshot and start logical decoding
1552-
* at exactly that position. That allows the slot to start up more
1553-
* quickly. But on a standby we cannot do WAL writes, so just use the
1554-
* replay pointer; effectively, an attempt to create a logical slot on
1555-
* standby will cause it to wait for an xl_running_xact record to be
1556-
* logged independently on the primary, so that a snapshot can be
1557-
* built using the record.
1558-
*
1559-
* None of this is needed (or indeed helpful) for physical slots as
1560-
* they'll start replay at the last logged checkpoint anyway. Instead
1561-
* return the location of the last redo LSN. While that slightly
1562-
* increases the chance that we have to retry, it's where a base
1563-
* backup has to start replay at.
1564-
*/
1565-
if (SlotIsPhysical(slot))
1566-
restart_lsn = GetRedoRecPtr();
1567-
else if (RecoveryInProgress())
1568-
restart_lsn = GetXLogReplayRecPtr(NULL);
1569-
else
1570-
restart_lsn = GetXLogInsertRecPtr();
1557+
/*
1558+
* For logical slots log a standby snapshot and start logical decoding at
1559+
* exactly that position. That allows the slot to start up more quickly.
1560+
* But on a standby we cannot do WAL writes, so just use the replay
1561+
* pointer; effectively, an attempt to create a logical slot on standby
1562+
* will cause it to wait for an xl_running_xact record to be logged
1563+
* independently on the primary, so that a snapshot can be built using the
1564+
* record.
1565+
*
1566+
* None of this is needed (or indeed helpful) for physical slots as
1567+
* they'll start replay at the last logged checkpoint anyway. Instead,
1568+
* return the location of the last redo LSN, where a base backup has to
1569+
* start replay at.
1570+
*/
1571+
if (SlotIsPhysical(slot))
1572+
restart_lsn = GetRedoRecPtr();
1573+
else if (RecoveryInProgress())
1574+
restart_lsn = GetXLogReplayRecPtr(NULL);
1575+
else
1576+
restart_lsn = GetXLogInsertRecPtr();
15711577

1572-
SpinLockAcquire(&slot->mutex);
1573-
slot->data.restart_lsn = restart_lsn;
1574-
SpinLockRelease(&slot->mutex);
1578+
SpinLockAcquire(&slot->mutex);
1579+
slot->data.restart_lsn = restart_lsn;
1580+
SpinLockRelease(&slot->mutex);
15751581

1576-
/* prevent WAL removal as fast as possible */
1577-
ReplicationSlotsComputeRequiredLSN();
1582+
/* prevent WAL removal as fast as possible */
1583+
ReplicationSlotsComputeRequiredLSN();
15781584

1579-
/*
1580-
* If all required WAL is still there, great, otherwise retry. The
1581-
* slot should prevent further removal of WAL, unless there's a
1582-
* concurrent ReplicationSlotsComputeRequiredLSN() after we've written
1583-
* the new restart_lsn above, so normally we should never need to loop
1584-
* more than twice.
1585-
*/
1586-
XLByteToSeg(slot->data.restart_lsn, segno, wal_segment_size);
1587-
if (XLogGetLastRemovedSegno() < segno)
1588-
break;
1589-
}
1585+
/* Checkpoint shouldn't remove the required WAL. */
1586+
XLByteToSeg(slot->data.restart_lsn, segno, wal_segment_size);
1587+
if (XLogGetLastRemovedSegno() >= segno)
1588+
elog(ERROR, "WAL required by replication slot %s has been removed concurrently",
1589+
NameStr(slot->data.name));
1590+
1591+
LWLockRelease(ReplicationSlotAllocationLock);
15901592

15911593
if (!RecoveryInProgress() && SlotIsLogical(slot))
15921594
{
@@ -2094,6 +2096,12 @@ CheckPointReplicationSlots(bool is_shutdown)
20942096
* acquiring a slot we cannot take the control lock - but that's OK,
20952097
* because holding ReplicationSlotAllocationLock is strictly stronger, and
20962098
* enough to guarantee that nobody can change the in_use bits on us.
2099+
*
2100+
* Additionally, acquiring the Allocation lock is necessary to serialize
2101+
* the slot flush process with concurrent slot WAL reservation. This
2102+
* ensures that the WAL position being reserved is either flushed to disk
2103+
* or is beyond or equal to the redo pointer of the current checkpoint
2104+
* (See ReplicationSlotReserveWal for details).
20972105
*/
20982106
LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
20992107

0 commit comments

Comments
 (0)