MVStore: Parallel batch read/write#4331
Conversation
Move PageSerializationManager from FileStore inner class to a standalone class in org.h2.mvstore. Core serialization state (WriteBuffer, ToC, page numbering, position encoding, checksum patching) is self-contained. FileStore side-effects (cachePage, accountForRemovedPage, accountForWrittenPage, cacheToC, countNewPage) are now injected through a Callback interface, wired by the new factory method FileStore.createPageSerializationManager(). Pure structural extraction — no behavior change. Enables future parallel serialization by allowing independent PSM instances per worker thread. Files changed: NEW PageSerializationManager.java — extracted class + Callback interface MOD FileStore.java — inner class removed, factory method added MOD Page.java — import path, write() signature unqualified
Buffer a saved chunk's on-disk content into memory only after a second
page is read from that chunk, not on first access. This avoids wasting
I/O and memory on scattered single-page lookups (e.g. secondary index
probes) while still accelerating sequential/range access patterns where
multiple pages from the same chunk are read.
Mechanism:
- First page read from a chunk: sets volatile hint flag, does normal
per-page I/O (no regression vs. baseline)
- Second page read: reads the entire chunk (if within threshold) into
Chunk.readBuffer; all subsequent reads slice from memory
- resolveChunkBuffer() checks readBuffer -> buffer -> hint-gated
full-chunk read
New fields on Chunk:
- volatile ByteBuffer readBuffer -- cached on-disk content
- volatile boolean readBufferHint -- set on first read, triggers
buffering on second read
- invalidateReadBuffer() -- clears both on block relocation
Chunk.readBufferForPage() restructured:
- New resolveChunkBuffer() implements the two-hit buffering policy
- PAGE_LARGE length pre-read satisfied from cached buffer when
available
- All page slicing unified through the resolved chunk buffer
Chunk.readToC() also checks readBuffer before per-region I/O.
FileStore gains a configurable threshold:
- chunkReadCacheMaxBytes field (default 4 MB, 0 to disable)
- getChunkReadCacheMaxBytes() / setChunkReadCacheMaxBytes(int)
FileStore.readPage() is unchanged -- buffering is fully transparent.
Files changed:
MOD Chunk.java -- readBuffer/readBufferHint fields,
invalidateReadBuffer(), resolveChunkBuffer(), modified
readBufferForPage() and readToC()
MOD FileStore.java -- chunkReadCacheMaxBytes field + getter/setter
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
When a cursor descends from a NonLeaf to a new leaf page, submit a
best-effort background prefetch for the next sibling child page. This
overlaps the I/O for leaf N+1 with processing of leaf N, halving
effective I/O latency for sequential range scans.
The prefetch uses getChildPagePos() to obtain the sibling's position
without loading it, then submits to ForkJoinPool.commonPool(). If the
page is already cached, the prefetch is a no-op. If the background read
fails, it is silently ignored -- the page will be demand-loaded normally.
Works in both forward and reverse cursor directions.
Cursor.hasNext():
- Track whether we actually descended through NonLeaf nodes
- After descent, call prefetchNextSibling() with the parent's position
- prefetchNextSibling() computes next/prev sibling index from the
parent's children[] array and calls MVMap.prefetchPage()
New plumbing (thin wrappers at each layer):
- MVMap.prefetchPage(pos) -> MVStore.prefetchPage(map, pos)
-> FileStore.prefetchPage(map, pos)
- FileStore.prefetchPage() checks cache, skips unsaved pages, submits
readPage() to ForkJoinPool.commonPool()
Files changed:
MOD Cursor.java -- descended flag, prefetchNextSibling() method
MOD MVMap.java -- prefetchPage(pos) wrapper
MOD MVStore.java -- prefetchPage(map, pos) wrapper
MOD FileStore.java -- prefetchPage() impl + ForkJoinPool import
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Replace the single-sibling prefetchNextSibling() with prefetchAhead()
which submits up to PREFETCH_WINDOW (4) upcoming sibling child pages
when the cursor crosses a leaf boundary. Prefetch is directional --
only siblings ahead in the scan direction are submitted.
This is a cursor-scoped alternative to the global NonLeaf readPage()
hook approach, which regressed point lookups by prefetching children
indiscriminately on every NonLeaf load. By keeping prefetch in the
cursor, only scan workloads pay the cost, and the scan direction
constrains which children are prefetched.
Cursor.hasNext():
- After descent to a new leaf, call prefetchAhead() with the
parent's position and scan direction
- prefetchAhead() iterates the parent's children[] in scan
direction, submitting up to PREFETCH_WINDOW pages via
MVMap.prefetchPage()
FileStore.java: reverts to Step 3 state (no readPage hook, no
IN_PREFETCH ThreadLocal, no nonLeafPrefetchWindow field, no
prefetchChildren method).
Files changed:
MOD Cursor.java -- PREFETCH_WINDOW constant, prefetchAhead()
replaces prefetchNextSibling()
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Add SerializedPageRecord to PageSerializationManager that captures buffer-level layout information for each serialized page: buffer offset, page length, type, composed pagePos, and ToC element. The Page reference is attached in onPageSerialized(). This is pure bookkeeping -- no behavior change. The record list provides the data needed by the upcoming rebasePositions() method (Step 5b) to adjust all page positions when merging per-map local buffers into a global buffer at a different base offset. Changes to getPagePosition(): - Appends a new SerializedPageRecord after computing the position Changes to onPageSerialized(): - Sets the Page reference on the most recent record New types: - SerializedPageRecord (public static final inner class) New methods: - getSerializedPages() -- returns the record list Files changed: MOD PageSerializationManager.java Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Add rebasePositions(int baseOffset) to PageSerializationManager
that adjusts all page positions when a local serialization buffer
is merged into a global chunk buffer at a non-zero offset.
The method performs two passes:
Pass 1 — for each serialized page:
- Recomposes the ToC element with offset + baseOffset
- Recomposes pagePos from the rebased ToC element
- Updates the ToC list entry
- Patches the check value in the buffer (check incorporates offset)
- CAS-updates Page.pos via new Page.rebasePos()
Pass 2 — for each NonLeaf page:
- Parses the page header to locate the child-pointer region
- Replaces child pointer longs that reference pages within this
buffer (looked up via old->new position map)
- Leaves child pointers to pages from other chunks untouched
SerializedPageRecord gains a mapId field (needed to recompose
ToC elements with the rebased offset).
Page.java gains package-private rebasePos(long expected, long new)
that atomically updates Page.pos via posUpdater CAS, throwing
MVStoreException on mismatch (indicates a rebase logic bug).
Files changed:
MOD PageSerializationManager.java -- mapId in record,
rebasePositions(), rebaseChildPointers()
MOD Page.java -- rebasePos() method, import/signature fixes
for top-level PSM
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Refactor serializeToBuffer() so that each changed map tree is
serialized through its own PageSerializationManager (with sequential
pageNoBase to avoid page-number collisions), while all PSMs still
write into the shared global WriteBuffer. This means page positions
are correct from the start — no local buffers and no rebase are
needed.
The layout map continues to get its own PSM (Phase 2), and
writeMergedToC() concatenates all per-map ToC entries followed
by the layout ToC into a single cached tocArray.
This is the correctness gate: execution is still single-threaded
and the buffer layout is identical to the original code, so any
regression indicates a bug in the PSM-per-map split. Local buffers
and rebase (needed for Step 5d parallelism) will be introduced as
a separate validated step.
New types:
FileStore.MapSerializationResult -- holds per-map PSM + root info
New/changed methods:
FileStore.writeMergedToC() -- merges ToC from all PSMs
FileStore.createPageSerializationManager(chunk, buff, pageNoBase)
PSM constructor with pageNoBase, PSM.getPageCount()
Files changed:
MOD FileStore.java -- serializeToBuffer rewritten,
MapSerializationResult, writeMergedToC
MOD PageSerializationManager.java -- pageNoBase field and
constructor, getPageCount()
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Serialize each MVMap's B-tree into its own local WriteBuffer and
PageSerializationManager, then copy into the global chunk buffer
at the map's assigned base offset. A three-pass rebase corrects
all position references from local to global coordinates:
Pass 1 – Patch page checksums, CAS-update each Page.pos field,
and rewrite ToC entries in the global buffer.
Pass 2 – Fix NonLeaf child-pointer slots in the on-disk buffer
so they reference rebased positions.
Pass 3 – Sync in-memory PageReference.pos values via
syncChildRefsAfterRebase(), preventing stale local
offsets from being used when cached child pages are
evicted and later re-read from disk.
Pass 3 fixes a subtle corruption bug: after rebase the on-disk
data was correct, but NonLeaf.children[].pos still held the
pre-rebase local offset. If memory pressure evicted a child page
(nulling the strong reference), getChildPage() would read at the
old local offset—landing in the chunk header/layout area and
producing "expected page length 4..384, got 1869575226" errors
(the garbage value decodes to ASCII "olg*" from layout strings).
The layout map is serialised last, directly into the global buffer
at the pre-reserved slot. A merged table-of-contents is built
from all per-map PSMs.
Serialization is still sequential in this step; the per-map split
establishes the precondition for parallel serialization in step 5d.
Files changed:
FileStore.java – phase 2 copy-then-rebase loop
PageSerializationManager.java – rebasePositions(offset, buffer)
with explicit ByteBuffer param;
pass 3 child-ref sync
Page.java – syncChildRefsAfterRebase() on
NonLeaf; no-op on Leaf
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Parallelize the B-tree serialization of independent MVMap trees
during chunk writes. Each map's writeUnsavedRecursive() now runs
concurrently via the common ForkJoinPool when two or more maps
have unsaved pages; single-map commits remain sequential.
The key enablers:
countUnsavedPages() — new abstract method on Page, implemented
in Leaf (trivial), NonLeaf (recursive), and IncompleteNonLeaf
(recursive, skips self when incomplete). Called sequentially in
Phase 1a to pre-compute per-map page counts, which are summed
into cumulative pageNoBase values so that each parallel worker
writes pages with globally-unique page numbers.
Deferred callbacks — PageSerializationManager gains a
deferCallbacks flag and applyDeferredCallbacks() method. When
deferred, onPageSerialized() records side-effect data (cache
insert, chunk accounting, removed-page tracking) in the
SerializedPageRecord instead of firing callbacks immediately.
Callbacks are replayed sequentially in Phase 2 after rebase,
ensuring:
(a) Thread safety — cachePage(), accountForWrittenPage(), and
accountForRemovedPage() are not thread-safe.
(b) Correct cache keys — pages are cached with their final
global positions rather than pre-rebase local offsets.
Serialization phases are now:
Phase 1a — Sequential: partition changed maps, count unsaved
pages, compute cumulative pageNoBase values.
Phase 1b — Parallel: each map serializes into its own
WriteBuffer + deferred PSM.
Phase 2 — Sequential: copy local buffers → global, rebase
positions, replay deferred callbacks, record roots.
Phase 3 — Sequential: layout map serialization.
Phase 4 — Sequential: merged ToC from all PSMs.
Files changed:
Page.java – countUnsavedPages() abstract +
Leaf/NonLeaf/IncompleteNonLeaf
PageSerializationManager.java – deferCallbacks flag, deferred
data fields on SerializedPageRecord,
applyDeferredCallbacks()
FileStore.java – Phase 1a/1b split with
IntStream.parallel(), deferred
PSM factory, applyDeferredCallbacks
call in Phase 2
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
|
Hi @manticore-projects, great job! |
|
Thank your for your warm feedback. I have made the benchmark repository public: https://github.com/manticore-projects/H2Benchmark
Then we can improve the benchmarks together. Especially I would like to ask for some help or guidance on the |
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
|
I have re-based everything onto latest H2 Git Origin/Master and also back-ported to JDK 11. |
|
I have had a close look at the CI tests:
The good news is: all tests succeed with this PR and there is no corruption anywhere, not even in my large DBs! |
…ache Implement the three improvements suggested during code review: ## Cursor prefetch with B-tree upper-bound analysis (Cursor.java) Add `maybePrefetchSiblings()`, called on every internal-node descent. Rather than a fixed sliding window, it uses the B-tree's own separator keys to determine which sibling subtrees actually fall within the cursor's `to` bound — stopping as soon as a separator key crosses the boundary. This avoids issuing any I/O for subtrees outside the requested range, making short-range scans pay only for what they need. Two short-circuit guards eliminate overhead for warm workloads: - If the store has no backing file, return immediately (pure in-memory). - If the first candidate sibling is already in the page cache, skip ForkJoin submission entirely — the working set is hot and no I/O is needed. This reduces the cost of repeated cursor iteration over cached data to a single cache probe per internal node. A `PREFETCH_MIN_SIBLINGS = 2` floor suppresses prefetch at the very tail of a subtree where scheduling overhead would outweigh any benefit. ## Disable chunk read cache by default (FileStore.java) Change `chunkReadCacheMaxBytes` default from 4 MB to 0. The chunk buffer heuristic buffers entire 1–4 MB raw chunk blobs on a second page miss from the same chunk. In practice, chunks are large collections of randomly-scattered pages — a cursor traversal rarely hits the same chunk twice, so the second-hit rate is low. The 4 MB is better spent on the page cache, which is keyed by page position and reuses actually-hot decoded Page objects across all access patterns. Callers that explicitly set `setChunkReadCacheMaxBytes()` are unaffected. ## Add `isPageCached()` helper (FileStore.java) Package-private probe into `CacheLongKeyLIRS` without triggering a load. Used by the warm-cache short-circuit above. ## Document the memory tradeoff (MVStore.java) Extend the `Builder.cacheSize()` Javadoc to explain why the chunk read cache was disabled and where that memory should go instead. Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Read throughput (ops/s, higher = better)
Mixed workload throughput (ops/s, higher = better)
Write / bulk latency (ms, lower = better)
Scan latency (ms, lower = better)
Compaction latency (ms, lower = better)
Join latency (ms, lower = better)
JUring wins with 10 green vs 7 red, with the wins concentrated exactly where they matter most: mixed read workloads (+6.1%/+6.8%) and scans. NIO is positive on reads and scans but still dragged down by write paths. Biggest problem right now is the volatility of the WRITE tests. |
Add background page prefetching to Cursor, triggered during B-tree
descent for bounded range scans. Four iterations of benchmarking shaped
the final design; the key constraints discovered empirically are
documented below.
## What was added
### Cursor.maybePrefetchSiblings() (Cursor.java)
Called once per internal-node level during hasNext() descent, collects
the sibling child positions that the cursor will visit after the current
subtree and submits them for background I/O.
Three guards control when prefetch is suppressed:
if (to == null) return;
The most important guard. Unbounded full-table scans saturate I/O on
their own; adding background reads doubles traffic and causes the
foreground thread to race its own prefetch tasks. Discovered after
coldCacheScan regressed +92% without it.
if (sibEnd - sibStart < PREFETCH_MIN_SIBLINGS) return; // = 8
Suppresses prefetch at the tail of nearly-exhausted subtrees where
scheduling overhead exceeds latency hidden. Threshold of 8 chosen
empirically; below this the ForkJoin submission cost dominates.
if (fs.isPageCached(firstPos)) return;
Warm-path short-circuit: if the first candidate sibling is already
in the page cache the working set is hot; skip entirely to avoid
unnecessary ForkJoin overhead on repeated warm iterations.
descentDepth < MAX_PREFETCH_DEPTH // = 1
Limits prefetch to the top two levels of each descent. Without this,
prefetch fires at every internal node on every descent, re-queuing
the same pages redundantly and causing foreground/background racing.
Upper-bound analysis uses the B-tree's own separator keys to skip
siblings whose entire key range lies outside [from, to], so short
range scans pay only for what they need.
### FileStore changes (FileStore.java)
isPageCached(): package-private cache probe without triggering a load,
used by the warm-path short-circuit above.
prefetchPages(): redesigned submission path:
- NIO: one ForkJoin task reads all positions sequentially. The
previous per-page task submission (N round-trips) dominated cost
for short prefetch lists and caused measurable write-path
regression.
- io_uring: one ForkJoin task submits in windows of MAX_PREFETCH_BATCH
(= 32) SQEs. Windowing prevents ring saturation under concurrent
read workloads; demand reads can interleave between windows.
chunkReadCacheMaxBytes default: 4 MB → 0. Chunks are large collections
of randomly-scattered pages; the hit rate is low and the memory is
better spent on the page cache.
## Benchmark summary (baseline NIO vs final, single-threaded)
randomCacheMissReads +13.1% NIO +9.7% JUring ✓
compositeIndexRangeScan +5.1% NIO -16.2% JUring (JUring open)
secondaryIndexLookup +2.9% NIO -6.9% JUring
mixed (read-heavy) +0.9% NIO +6.8% JUring ✓
coldCacheScan -4.9% NIO -2.6% JUring ✓ (was +92%)
indexCreationAndCompaction -3.1% NIO -1.6% JUring ✓ (was +25%)
joinAggregateGroupBy -9.8% NIO -2.1% JUring ✓
bulkInsert +17.6% NIO -0.2% JUring (NIO write open)
testUpdate -8.9% NIO -1.0% JUring (write contention)
JUring benefits most from prefetch due to true kernel-level batching.
NIO write-path regressions (bulkInsert, testUpdate) reflect background
task I/O contention during write-heavy operations; further work needed.
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
|
@andreitokar Please do you have any idea how to "stabilize" the The first and second iteration render this benchmark useless. |
…iority gate
Prior to this change, prefetchPages() submitted background read tasks to
ForkJoinPool.commonPool(). The write-path parallel serialization
(serializeToBuffer Phase 1) also runs on commonPool via IntStream.parallel().
Under write-heavy workloads, prefetch I/O tasks occupied common-pool threads
and starved write serialization, causing significant regressions:
bulkInsert +18% slower (ms/op)
highChurnCompaction +30% slower (ms/op)
benchmarkDBCreation +38% slower (ms/op)
mixed:testUpdate -9% throughput loss
Changes:
1. Dedicated prefetchExecutor (FileStore)
Replace all ForkJoinPool.commonPool().execute() calls in prefetchPage()
and prefetchPages() with a single-threaded ThreadPoolExecutor:
- Thread.MIN_PRIORITY daemon thread ("H2-prefetch")
- Bounded ArrayBlockingQueue(4) + DiscardOldestPolicy: stale prefetch
requests are silently dropped under write pressure rather than queuing
- Shut down in close() before layout map is closed
2. Write-priority gate (FileStore.isWritePipelineBusy)
New isWritePipelineBusy() checks serializationExecutor and
bufferSaveExecutor queue depths. If either has pending work, prefetchPages()
returns immediately, ceding full I/O bandwidth to the write path.
3. Mid-task abort
NIO sequential prefetch loop and io_uring batch windows both re-check
isWritePipelineBusy() between pages/windows and abort early if writes arrive
after the task has already started.
Results vs baseline (JMH 1.37, JDK 25, single-threaded, G1GC):
indexCreationAndCompaction +25% (NIO), +15% (juring) 🟢
coldCacheScan +7% (NIO) ✅
compositeIndexRangeScan +6% (NIO) ✅
randomCacheMissReads +4% (NIO), +5% (juring) ✅
bulkInsert +1% (NIO) [regression resolved]
highChurnCompaction -1% (NIO) [regression resolved]
benchmarkDBCreation +1% (NIO) [regression resolved]
Known open issues (pre-existing, not introduced here):
joinChurnCompaction -15% on both backends (under investigation)
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
|
We are getting there: READ has now all advantages and WRITE is on par. Write Benchmarks
Read / Scan Benchmarks
|
|
How can we move this forward please? |
Greetings! This is a feeler PR for speeding up MVStore I/O.
I put a lot of time and effort into this, so please be mild on me. Thank you.
Summary:
Parallel ReadsandParallel Writesfor MVStore speeding up such workloads:Throughput (ops/s, higher = better)
Latency (ms, lower = better)
My ultimate goal was to speed up MVStore using a JUring based FileChannel. JUring is a Panama wrapper around
libUring, please see https://github.com/manticore-projects/JUring/tree/filechannel and DB like benchmarks.Technically it works, but there is no benefit yet from the
JUringFileChannel.However, there seem to be benefits on
NIO baselinevs.NIO parallel read/writeand I wonder if this may be interesting.Full benchmarks attached.
You should be able to run by yourself with different workload factors. Beware: at 100% workload, then benchmarks take more than 1 hour and need more than 10 GB filespace. Also please use EXT4 or XFS (but avoid compression/encryption/BTRFS etc.). Last but not least, we try to flush the Linux cache between, which depends on
sudo, please follow the warnings.Any questions, concerns or recommendation are most welcome. Just let me know how to make this more useful please.
human_baseline.txt
human.txt