Skip to content

Commit 20a48c1

Browse files
committed
WAL-log inplace update before revealing it to other sessions.
A buffer lock won't stop a reader having already checked tuple visibility. If a vac_update_datfrozenid() and then a crash happened during inplace update of a relfrozenxid value, datfrozenxid could overtake relfrozenxid. That could lead to "could not access status of transaction" errors. Back-patch to v14 - v17. This is a back-patch of commits: - 8e7e672 (main change, on master, before v18 branched) - 8180136 (defect fix, on master, before v18 branched) It reverses commit bc6bad8, my revert of the original back-patch. In v14, this also back-patches the assertion removal from commit 7fcf2fa. Discussion: https://postgr.es/m/20240620012908.92.nmisch@google.com Backpatch-through: 14-17
1 parent 05d605b commit 20a48c1

File tree

3 files changed

+59
-21
lines changed

3 files changed

+59
-21
lines changed

src/backend/access/heap/README.tuplock

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -198,9 +198,7 @@ Inplace updates create an exception to the rule that tuple data won't change
198198
under a reader holding a pin. A reader of a heap_fetch() result tuple may
199199
witness a torn read. Current inplace-updated fields are aligned and are no
200200
wider than four bytes, and current readers don't need consistency across
201-
fields. Hence, they get by with just fetching each field once. XXX such a
202-
caller may also read a value that has not reached WAL; see
203-
systable_inplace_update_finish().
201+
fields. Hence, they get by with just fetching each field once.
204202

205203
During logical decoding, caches reflect an inplace update no later than the
206204
next XLOG_XACT_INVALIDATIONS. That record witnesses the end of a command.

src/backend/access/heap/heapam.c

Lines changed: 54 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6397,13 +6397,18 @@ heap_inplace_update_and_unlock(Relation relation,
63976397
HeapTupleHeader htup = oldtup->t_data;
63986398
uint32 oldlen;
63996399
uint32 newlen;
6400+
char *dst;
6401+
char *src;
64006402

64016403
Assert(ItemPointerEquals(&oldtup->t_self, &tuple->t_self));
64026404
oldlen = oldtup->t_len - htup->t_hoff;
64036405
newlen = tuple->t_len - tuple->t_data->t_hoff;
64046406
if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
64056407
elog(ERROR, "wrong tuple length");
64066408

6409+
dst = (char *) htup + htup->t_hoff;
6410+
src = (char *) tuple->t_data + tuple->t_data->t_hoff;
6411+
64076412
/*
64086413
* Unlink relcache init files as needed. If unlinking, acquire
64096414
* RelCacheInitLock until after associated invalidations. By doing this
@@ -6414,15 +6419,15 @@ heap_inplace_update_and_unlock(Relation relation,
64146419
*/
64156420
PreInplace_Inval();
64166421

6417-
/* NO EREPORT(ERROR) from here till changes are logged */
6418-
START_CRIT_SECTION();
6419-
6420-
memcpy((char *) htup + htup->t_hoff,
6421-
(char *) tuple->t_data + tuple->t_data->t_hoff,
6422-
newlen);
6423-
64246422
/*----------
6425-
* XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid:
6423+
* NO EREPORT(ERROR) from here till changes are complete
6424+
*
6425+
* Our buffer lock won't stop a reader having already pinned and checked
6426+
* visibility for this tuple. Hence, we write WAL first, then mutate the
6427+
* buffer. Like in MarkBufferDirtyHint() or RecordTransactionCommit(),
6428+
* checkpoint delay makes that acceptable. With the usual order of
6429+
* changes, a crash after memcpy() and before XLogInsert() could allow
6430+
* datfrozenxid to overtake relfrozenxid:
64266431
*
64276432
* ["D" is a VACUUM (ONLY_DATABASE_STATS)]
64286433
* ["R" is a VACUUM tbl]
@@ -6432,31 +6437,65 @@ heap_inplace_update_and_unlock(Relation relation,
64326437
* D: raise pg_database.datfrozenxid, XLogInsert(), finish
64336438
* [crash]
64346439
* [recovery restores datfrozenxid w/o relfrozenxid]
6435-
*/
6436-
6437-
MarkBufferDirty(buffer);
6440+
*
6441+
* Mimic MarkBufferDirtyHint() subroutine XLogSaveBufferForHint().
6442+
* Specifically, use DELAY_CHKPT_START, and copy the buffer to the stack.
6443+
* The stack copy facilitates a FPI of the post-mutation block before we
6444+
* accept other sessions seeing it. DELAY_CHKPT_START allows us to
6445+
* XLogInsert() before MarkBufferDirty(). Since XLogSaveBufferForHint()
6446+
* can operate under BUFFER_LOCK_SHARED, it can't avoid DELAY_CHKPT_START.
6447+
* This function, however, likely could avoid it with the following order
6448+
* of operations: MarkBufferDirty(), XLogInsert(), memcpy(). Opt to use
6449+
* DELAY_CHKPT_START here, too, as a way to have fewer distinct code
6450+
* patterns to analyze. Inplace update isn't so frequent that it should
6451+
* pursue the small optimization of skipping DELAY_CHKPT_START.
6452+
*/
6453+
Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
6454+
START_CRIT_SECTION();
6455+
MyProc->delayChkptFlags |= DELAY_CHKPT_START;
64386456

64396457
/* XLOG stuff */
64406458
if (RelationNeedsWAL(relation))
64416459
{
64426460
xl_heap_inplace xlrec;
6461+
PGAlignedBlock copied_buffer;
6462+
char *origdata = (char *) BufferGetBlock(buffer);
6463+
Page page = BufferGetPage(buffer);
6464+
uint16 lower = ((PageHeader) page)->pd_lower;
6465+
uint16 upper = ((PageHeader) page)->pd_upper;
6466+
uintptr_t dst_offset_in_block;
6467+
RelFileNode rnode;
6468+
ForkNumber forkno;
6469+
BlockNumber blkno;
64436470
XLogRecPtr recptr;
64446471

64456472
xlrec.offnum = ItemPointerGetOffsetNumber(&tuple->t_self);
64466473

64476474
XLogBeginInsert();
64486475
XLogRegisterData((char *) &xlrec, SizeOfHeapInplace);
64496476

6450-
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
6451-
XLogRegisterBufData(0, (char *) htup + htup->t_hoff, newlen);
6477+
/* register block matching what buffer will look like after changes */
6478+
memcpy(copied_buffer.data, origdata, lower);
6479+
memcpy(copied_buffer.data + upper, origdata + upper, BLCKSZ - upper);
6480+
dst_offset_in_block = dst - origdata;
6481+
memcpy(copied_buffer.data + dst_offset_in_block, src, newlen);
6482+
BufferGetTag(buffer, &rnode, &forkno, &blkno);
6483+
Assert(forkno == MAIN_FORKNUM);
6484+
XLogRegisterBlock(0, &rnode, forkno, blkno, copied_buffer.data,
6485+
REGBUF_STANDARD);
6486+
XLogRegisterBufData(0, src, newlen);
64526487

64536488
/* inplace updates aren't decoded atm, don't log the origin */
64546489

64556490
recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_INPLACE);
64566491

6457-
PageSetLSN(BufferGetPage(buffer), recptr);
6492+
PageSetLSN(page, recptr);
64586493
}
64596494

6495+
memcpy(dst, src, newlen);
6496+
6497+
MarkBufferDirty(buffer);
6498+
64606499
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
64616500

64626501
/*
@@ -6465,6 +6504,7 @@ heap_inplace_update_and_unlock(Relation relation,
64656504
*/
64666505
AtInplace_Inval();
64676506

6507+
MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
64686508
END_CRIT_SECTION();
64696509
UnlockTuple(relation, &tuple->t_self, InplaceUpdateTupleLock);
64706510

src/include/storage/proc.h

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -107,10 +107,10 @@ struct XidCache
107107
* is inserted prior to the new redo point, the corresponding data changes will
108108
* also be flushed to disk before the checkpoint can complete. (In the
109109
* extremely common case where the data being modified is in shared buffers
110-
* and we acquire an exclusive content lock on the relevant buffers before
111-
* writing WAL, this mechanism is not needed, because phase 2 will block
112-
* until we release the content lock and then flush the modified data to
113-
* disk.)
110+
* and we acquire an exclusive content lock and MarkBufferDirty() on the
111+
* relevant buffers before writing WAL, this mechanism is not needed, because
112+
* phase 2 will block until we release the content lock and then flush the
113+
* modified data to disk. See transam/README and SyncOneBuffer().)
114114
*
115115
* Setting DELAY_CHKPT_COMPLETE prevents the system from moving from phase 2
116116
* to phase 3. This is useful if we are performing a WAL-logged operation that

0 commit comments

Comments
 (0)