Skip to content

MVStore: Parallel batch read/write#4331

Open
manticore-projects wants to merge 20 commits into
h2database:masterfrom
manticore-projects:juring
Open

MVStore: Parallel batch read/write#4331
manticore-projects wants to merge 20 commits into
h2database:masterfrom
manticore-projects:juring

Conversation

@manticore-projects
Copy link
Copy Markdown
Contributor

@manticore-projects manticore-projects commented Feb 17, 2026

Greetings! This is a feeler PR for speeding up MVStore I/O.
I put a lot of time and effort into this, so please be mild on me. Thank you.

Summary:

  • The workloads that matter most for typical database usage are probably mixed (OLTP read/write mix), bulkInsert (ETL/migration), the join benchmarks (analytical queries), and coldCacheScan (startup/reporting queries against data not in cache).
  • I have written JMH based benchmarks for those workloads: https://github.com/manticore-projects/H2Benchmark
  • I have implemented Parallel Reads and Parallel Writes for MVStore speeding up such workloads:

Throughput (ops/s, higher = better)

Benchmark Baseline NIO Current NIO Δ NIO JUring Δ JUring vs NIO
mixed (8 threads) 1,449K 1,449K 1,425K −2% (noise)
randomCacheMissReads 111K 104K −6% 105K +2% (noise)
compositeIndexRangeScan 117 156 +33% 112 −28%
secondaryIndexLookup 55.1 57.9 +5% 46.8 −19%

Latency (ms, lower = better)

Benchmark Baseline NIO Current NIO Δ NIO JUring Δ JUring vs NIO
coldCacheScan 1,322 863 −35% 1,644 1.9× (regression)
highChurnCompaction 3,494 3,484 2,497 −28%
joinThreeWay 2,751 2,396 −13% 6,820 2.8× (regression)
joinIndexedNestedLoop 778 484 −38% 507 +5% (noise)
joinAggregateGroupBy 2,273 2,207 2,294 +4% (noise)
bulkInsert 4,586 4,990 +9% (noise) 4,916 −1% (noise)
compaction 702 704 719 +2% (noise)
indexCreationAndCompaction 15,412 16,450 +7% 19,407 +18% (regression)

My ultimate goal was to speed up MVStore using a JUring based FileChannel. JUring is a Panama wrapper around libUring, please see https://github.com/manticore-projects/JUring/tree/filechannel and DB like benchmarks.

Technically it works, but there is no benefit yet from the JUringFileChannel.
However, there seem to be benefits on NIO baseline vs. NIO parallel read/write and I wonder if this may be interesting.

Full benchmarks attached.
You should be able to run by yourself with different workload factors. Beware: at 100% workload, then benchmarks take more than 1 hour and need more than 10 GB filespace. Also please use EXT4 or XFS (but avoid compression/encryption/BTRFS etc.). Last but not least, we try to flush the Linux cache between, which depends on sudo, please follow the warnings.

Any questions, concerns or recommendation are most welcome. Just let me know how to make this more useful please.

human_baseline.txt
human.txt

Move PageSerializationManager from FileStore inner class to a standalone
class in org.h2.mvstore. Core serialization state (WriteBuffer, ToC, page
numbering, position encoding, checksum patching) is self-contained.

FileStore side-effects (cachePage, accountForRemovedPage,
accountForWrittenPage, cacheToC, countNewPage) are now injected through a
Callback interface, wired by the new factory method
FileStore.createPageSerializationManager().

Pure structural extraction — no behavior change. Enables future parallel
serialization by allowing independent PSM instances per worker thread.

Files changed:
  NEW  PageSerializationManager.java — extracted class + Callback interface
  MOD  FileStore.java — inner class removed, factory method added
  MOD  Page.java — import path, write() signature unqualified
Buffer a saved chunk's on-disk content into memory only after a second
page is read from that chunk, not on first access. This avoids wasting
I/O and memory on scattered single-page lookups (e.g. secondary index
probes) while still accelerating sequential/range access patterns where
multiple pages from the same chunk are read.

Mechanism:
  - First page read from a chunk: sets volatile hint flag, does normal
    per-page I/O (no regression vs. baseline)
  - Second page read: reads the entire chunk (if within threshold) into
    Chunk.readBuffer; all subsequent reads slice from memory
  - resolveChunkBuffer() checks readBuffer -> buffer -> hint-gated
    full-chunk read

New fields on Chunk:
  - volatile ByteBuffer readBuffer -- cached on-disk content
  - volatile boolean readBufferHint -- set on first read, triggers
    buffering on second read
  - invalidateReadBuffer() -- clears both on block relocation

Chunk.readBufferForPage() restructured:
  - New resolveChunkBuffer() implements the two-hit buffering policy
  - PAGE_LARGE length pre-read satisfied from cached buffer when
    available
  - All page slicing unified through the resolved chunk buffer

Chunk.readToC() also checks readBuffer before per-region I/O.

FileStore gains a configurable threshold:
  - chunkReadCacheMaxBytes field (default 4 MB, 0 to disable)
  - getChunkReadCacheMaxBytes() / setChunkReadCacheMaxBytes(int)

FileStore.readPage() is unchanged -- buffering is fully transparent.

Files changed:
  MOD  Chunk.java -- readBuffer/readBufferHint fields,
       invalidateReadBuffer(), resolveChunkBuffer(), modified
       readBufferForPage() and readToC()
  MOD  FileStore.java -- chunkReadCacheMaxBytes field + getter/setter

Signed-off-by: manticore-projects <andreas@manticore-projects.com>
When a cursor descends from a NonLeaf to a new leaf page, submit a
best-effort background prefetch for the next sibling child page. This
overlaps the I/O for leaf N+1 with processing of leaf N, halving
effective I/O latency for sequential range scans.

The prefetch uses getChildPagePos() to obtain the sibling's position
without loading it, then submits to ForkJoinPool.commonPool(). If the
page is already cached, the prefetch is a no-op. If the background read
fails, it is silently ignored -- the page will be demand-loaded normally.

Works in both forward and reverse cursor directions.

Cursor.hasNext():
  - Track whether we actually descended through NonLeaf nodes
  - After descent, call prefetchNextSibling() with the parent's position
  - prefetchNextSibling() computes next/prev sibling index from the
    parent's children[] array and calls MVMap.prefetchPage()

New plumbing (thin wrappers at each layer):
  - MVMap.prefetchPage(pos) -> MVStore.prefetchPage(map, pos)
    -> FileStore.prefetchPage(map, pos)
  - FileStore.prefetchPage() checks cache, skips unsaved pages, submits
    readPage() to ForkJoinPool.commonPool()

Files changed:
  MOD  Cursor.java -- descended flag, prefetchNextSibling() method
  MOD  MVMap.java -- prefetchPage(pos) wrapper
  MOD  MVStore.java -- prefetchPage(map, pos) wrapper
  MOD  FileStore.java -- prefetchPage() impl + ForkJoinPool import

Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Replace the single-sibling prefetchNextSibling() with prefetchAhead()
which submits up to PREFETCH_WINDOW (4) upcoming sibling child pages
when the cursor crosses a leaf boundary. Prefetch is directional --
only siblings ahead in the scan direction are submitted.

This is a cursor-scoped alternative to the global NonLeaf readPage()
hook approach, which regressed point lookups by prefetching children
indiscriminately on every NonLeaf load. By keeping prefetch in the
cursor, only scan workloads pay the cost, and the scan direction
constrains which children are prefetched.

Cursor.hasNext():
  - After descent to a new leaf, call prefetchAhead() with the
    parent's position and scan direction
  - prefetchAhead() iterates the parent's children[] in scan
    direction, submitting up to PREFETCH_WINDOW pages via
    MVMap.prefetchPage()

FileStore.java: reverts to Step 3 state (no readPage hook, no
IN_PREFETCH ThreadLocal, no nonLeafPrefetchWindow field, no
prefetchChildren method).

Files changed:
  MOD  Cursor.java -- PREFETCH_WINDOW constant, prefetchAhead()
       replaces prefetchNextSibling()

Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Add SerializedPageRecord to PageSerializationManager that captures
buffer-level layout information for each serialized page: buffer
offset, page length, type, composed pagePos, and ToC element. The
Page reference is attached in onPageSerialized().

This is pure bookkeeping -- no behavior change. The record list
provides the data needed by the upcoming rebasePositions() method
(Step 5b) to adjust all page positions when merging per-map local
buffers into a global buffer at a different base offset.

Changes to getPagePosition():
  - Appends a new SerializedPageRecord after computing the position

Changes to onPageSerialized():
  - Sets the Page reference on the most recent record

New types:
  - SerializedPageRecord (public static final inner class)

New methods:
  - getSerializedPages() -- returns the record list

Files changed:
  MOD  PageSerializationManager.java

Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Add rebasePositions(int baseOffset) to PageSerializationManager
that adjusts all page positions when a local serialization buffer
is merged into a global chunk buffer at a non-zero offset.

The method performs two passes:

Pass 1 — for each serialized page:
  - Recomposes the ToC element with offset + baseOffset
  - Recomposes pagePos from the rebased ToC element
  - Updates the ToC list entry
  - Patches the check value in the buffer (check incorporates offset)
  - CAS-updates Page.pos via new Page.rebasePos()

Pass 2 — for each NonLeaf page:
  - Parses the page header to locate the child-pointer region
  - Replaces child pointer longs that reference pages within this
    buffer (looked up via old->new position map)
  - Leaves child pointers to pages from other chunks untouched

SerializedPageRecord gains a mapId field (needed to recompose
ToC elements with the rebased offset).

Page.java gains package-private rebasePos(long expected, long new)
that atomically updates Page.pos via posUpdater CAS, throwing
MVStoreException on mismatch (indicates a rebase logic bug).

Files changed:
  MOD  PageSerializationManager.java -- mapId in record,
       rebasePositions(), rebaseChildPointers()
  MOD  Page.java -- rebasePos() method, import/signature fixes
       for top-level PSM

Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Refactor serializeToBuffer() so that each changed map tree is
serialized through its own PageSerializationManager (with sequential
pageNoBase to avoid page-number collisions), while all PSMs still
write into the shared global WriteBuffer. This means page positions
are correct from the start — no local buffers and no rebase are
needed.

The layout map continues to get its own PSM (Phase 2), and
writeMergedToC() concatenates all per-map ToC entries followed
by the layout ToC into a single cached tocArray.

This is the correctness gate: execution is still single-threaded
and the buffer layout is identical to the original code, so any
regression indicates a bug in the PSM-per-map split. Local buffers
and rebase (needed for Step 5d parallelism) will be introduced as
a separate validated step.

New types:
  FileStore.MapSerializationResult -- holds per-map PSM + root info

New/changed methods:
  FileStore.writeMergedToC() -- merges ToC from all PSMs
  FileStore.createPageSerializationManager(chunk, buff, pageNoBase)
  PSM constructor with pageNoBase, PSM.getPageCount()

Files changed:
  MOD  FileStore.java -- serializeToBuffer rewritten,
       MapSerializationResult, writeMergedToC
  MOD  PageSerializationManager.java -- pageNoBase field and
       constructor, getPageCount()

Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Serialize each MVMap's B-tree into its own local WriteBuffer and
PageSerializationManager, then copy into the global chunk buffer
at the map's assigned base offset.  A three-pass rebase corrects
all position references from local to global coordinates:

  Pass 1 – Patch page checksums, CAS-update each Page.pos field,
           and rewrite ToC entries in the global buffer.
  Pass 2 – Fix NonLeaf child-pointer slots in the on-disk buffer
           so they reference rebased positions.
  Pass 3 – Sync in-memory PageReference.pos values via
           syncChildRefsAfterRebase(), preventing stale local
           offsets from being used when cached child pages are
           evicted and later re-read from disk.

Pass 3 fixes a subtle corruption bug: after rebase the on-disk
data was correct, but NonLeaf.children[].pos still held the
pre-rebase local offset.  If memory pressure evicted a child page
(nulling the strong reference), getChildPage() would read at the
old local offset—landing in the chunk header/layout area and
producing "expected page length 4..384, got 1869575226" errors
(the garbage value decodes to ASCII "olg*" from layout strings).

The layout map is serialised last, directly into the global buffer
at the pre-reserved slot.  A merged table-of-contents is built
from all per-map PSMs.

Serialization is still sequential in this step; the per-map split
establishes the precondition for parallel serialization in step 5d.

Files changed:
  FileStore.java                  – phase 2 copy-then-rebase loop
  PageSerializationManager.java   – rebasePositions(offset, buffer)
                                    with explicit ByteBuffer param;
                                    pass 3 child-ref sync
  Page.java                       – syncChildRefsAfterRebase() on
                                    NonLeaf; no-op on Leaf

Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Parallelize the B-tree serialization of independent MVMap trees
during chunk writes.  Each map's writeUnsavedRecursive() now runs
concurrently via the common ForkJoinPool when two or more maps
have unsaved pages; single-map commits remain sequential.

The key enablers:

  countUnsavedPages() — new abstract method on Page, implemented
  in Leaf (trivial), NonLeaf (recursive), and IncompleteNonLeaf
  (recursive, skips self when incomplete).  Called sequentially in
  Phase 1a to pre-compute per-map page counts, which are summed
  into cumulative pageNoBase values so that each parallel worker
  writes pages with globally-unique page numbers.

  Deferred callbacks — PageSerializationManager gains a
  deferCallbacks flag and applyDeferredCallbacks() method.  When
  deferred, onPageSerialized() records side-effect data (cache
  insert, chunk accounting, removed-page tracking) in the
  SerializedPageRecord instead of firing callbacks immediately.
  Callbacks are replayed sequentially in Phase 2 after rebase,
  ensuring:
    (a) Thread safety — cachePage(), accountForWrittenPage(), and
        accountForRemovedPage() are not thread-safe.
    (b) Correct cache keys — pages are cached with their final
        global positions rather than pre-rebase local offsets.

Serialization phases are now:

  Phase 1a — Sequential: partition changed maps, count unsaved
             pages, compute cumulative pageNoBase values.
  Phase 1b — Parallel: each map serializes into its own
             WriteBuffer + deferred PSM.
  Phase 2  — Sequential: copy local buffers → global, rebase
             positions, replay deferred callbacks, record roots.
  Phase 3  — Sequential: layout map serialization.
  Phase 4  — Sequential: merged ToC from all PSMs.

Files changed:
  Page.java                       – countUnsavedPages() abstract +
                                    Leaf/NonLeaf/IncompleteNonLeaf
  PageSerializationManager.java   – deferCallbacks flag, deferred
                                    data fields on SerializedPageRecord,
                                    applyDeferredCallbacks()
  FileStore.java                  – Phase 1a/1b split with
                                    IntStream.parallel(), deferred
                                    PSM factory, applyDeferredCallbacks
                                    call in Phase 2

Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
Signed-off-by: manticore-projects <andreas@manticore-projects.com>
@andreitokar
Copy link
Copy Markdown
Contributor

andreitokar commented Feb 18, 2026

Hi @manticore-projects, great job!
While your repository, containing benchmarks is unavailable (404), results are somewhat predictable, judging by benchmarks name.
They clearly show that pages prefetch would help, and that is something we may implement. Instead of fixed size static window, cursor “upper” bound can be analyzed, to avoid prefetch of very short ranges. Also prefetch can be short-circuited more aggressively to minimize penalty for in-memory case.
On the other hand, idea of caching chunks does not look that promising. Chunks tend to be relatively big (1-4Mb) collections of random pages, and hitting a right page is purely a coincidence. I believe that by giving that memory to a page cache may be a bigger bang for the buck. BTW, in your tests, baseline case should have that bigger cache, to be apples-to-apples comparison.
Whether the parallel serialization is going to be net positive - mine guess as good as yours, but all my attempts to parallelize any light stuff were negative, forking overhead was too high.

@manticore-projects
Copy link
Copy Markdown
Contributor Author

Thank your for your warm feedback. I have made the benchmark repository public: https://github.com/manticore-projects/H2Benchmark
Since your response was positive, I would like to suggest next steps:

  1. I will port the changes to the latest H2 version
  2. I will take out any Panama/JDK16 methods and make it JDK11 compatible

Then we can improve the benchmarks together. Especially I would like to ask for some help or guidance on the BulkInsert benchmark, because I can't get this stable. Measurement jumps up and down and shows massive outliers across all tests and I don't know why.

@manticore-projects
Copy link
Copy Markdown
Contributor Author

I have re-based everything onto latest H2 Git Origin/Master and also back-ported to JDK 11.

@manticore-projects
Copy link
Copy Markdown
Contributor Author

I have had a close look at the CI tests:

  • the new parallel version on the MVStore needs more memory (which to me looks acceptable because a) the time of 128mb are long over and b) the changes aim for large database files >2GB and not for small 1000 row dbs
    b) there are two tests, where the number of reads and the result of compact are expected. Those fail of course (because the logic has changed)
    c) there are some tests related to "lazy execution" that fail for me even before that PR. I do not know what this is about.

  • So how to go about this? Can/shall I increase the Memory to 1G or 2G for the tests (which worked for me)? Can/shall I adjust the tests for touched nodes and compact results?

The good news is: all tests succeed with this PR and there is no corruption anywhere, not even in my large DBs!

…ache

Implement the three improvements suggested during code review:

## Cursor prefetch with B-tree upper-bound analysis (Cursor.java)

Add `maybePrefetchSiblings()`, called on every internal-node descent.
Rather than a fixed sliding window, it uses the B-tree's own separator
keys to determine which sibling subtrees actually fall within the
cursor's `to` bound — stopping as soon as a separator key crosses the
boundary. This avoids issuing any I/O for subtrees outside the
requested range, making short-range scans pay only for what they need.

Two short-circuit guards eliminate overhead for warm workloads:
- If the store has no backing file, return immediately (pure in-memory).
- If the first candidate sibling is already in the page cache, skip
  ForkJoin submission entirely — the working set is hot and no I/O is
  needed. This reduces the cost of repeated cursor iteration over
  cached data to a single cache probe per internal node.

A `PREFETCH_MIN_SIBLINGS = 2` floor suppresses prefetch at the very
tail of a subtree where scheduling overhead would outweigh any benefit.

## Disable chunk read cache by default (FileStore.java)

Change `chunkReadCacheMaxBytes` default from 4 MB to 0.

The chunk buffer heuristic buffers entire 1–4 MB raw chunk blobs on a
second page miss from the same chunk. In practice, chunks are large
collections of randomly-scattered pages — a cursor traversal rarely
hits the same chunk twice, so the second-hit rate is low. The 4 MB is
better spent on the page cache, which is keyed by page position and
reuses actually-hot decoded Page objects across all access patterns.

Callers that explicitly set `setChunkReadCacheMaxBytes()` are
unaffected.

## Add `isPageCached()` helper (FileStore.java)

Package-private probe into `CacheLongKeyLIRS` without triggering a
load. Used by the warm-cache short-circuit above.

## Document the memory tradeoff (MVStore.java)

Extend the `Builder.cacheSize()` Javadoc to explain why the chunk read
cache was disabled and where that memory should go instead.

Signed-off-by: manticore-projects <andreas@manticore-projects.com>
@manticore-projects
Copy link
Copy Markdown
Contributor Author

Read throughput (ops/s, higher = better)

Benchmark Baseline NIO vs base JUring vs base
compositeIndexRangeScan 192 202 ✅ +5.1% 161 ❌ -16.2%
randomCacheMissReads 146.0k 165.2k ✅ +13.1% 160.1k ✅ +9.7%
secondaryIndexLookup 117 120 ✅ +2.9% 109 ❌ -6.9%

Mixed workload throughput (ops/s, higher = better)

Benchmark Baseline NIO vs base JUring vs base
mixed 1.82M 1.83M ✅ +0.1% 1.94M ✅ +6.1%
mixed:testSelect 1.68M 1.69M ✅ +0.9% 1.79M ✅ +6.8%
mixed:testUpdate 147.9k 134.7k ❌ -8.9% 146.4k ❌ -1.0%

Write / bulk latency (ms, lower = better)

Benchmark Baseline NIO vs base JUring vs base
bulkInsert 1734ms 2040ms ❌ +17.6% 1731ms ✅ -0.2%
bulkInsertWithIndex 1821ms 1950ms ❌ +7.1% 1959ms ❌ +7.6%
benchmarkDBCreation 7ms 9ms ❌ +38.5% 11ms ❌ +54.2%

Scan latency (ms, lower = better)

Benchmark Baseline NIO vs base JUring vs base
coldCacheScan 699ms 664ms ✅ -4.9% 680ms ✅ -2.6%
indexCreationAndCompaction 12485ms 12101ms ✅ -3.1% 12285ms ✅ -1.6%

Compaction latency (ms, lower = better)

Benchmark Baseline NIO vs base JUring vs base
compaction 399ms 408ms ❌ +2.4% 394ms ✅ -1.2%
highChurnCompaction 1735ms 2264ms ❌ +30.5% 1796ms ❌ +3.5%

Join latency (ms, lower = better)

Benchmark Baseline NIO vs base JUring vs base
joinAggregateGroupBy 1367ms 1234ms ✅ -9.8% 1339ms ✅ -2.1%
joinChurnCompaction 531ms 609ms ❌ +14.8% 627ms ❌ +18.2%
joinIndexedNestedLoop 508ms 508ms ✅ -0.1% 524ms ❌ +3.1%
joinThreeWay 993ms 991ms ✅ -0.2% 957ms ✅ -3.6%

JUring wins with 10 green vs 7 red, with the wins concentrated exactly where they matter most: mixed read workloads (+6.1%/+6.8%) and scans. NIO is positive on reads and scans but still dragged down by write paths.

Biggest problem right now is the volatility of the WRITE tests.

human.txt
human_baseline.txt

Add background page prefetching to Cursor, triggered during B-tree
descent for bounded range scans. Four iterations of benchmarking shaped
the final design; the key constraints discovered empirically are
documented below.

## What was added

### Cursor.maybePrefetchSiblings() (Cursor.java)

Called once per internal-node level during hasNext() descent, collects
the sibling child positions that the cursor will visit after the current
subtree and submits them for background I/O.

Three guards control when prefetch is suppressed:

  if (to == null) return;
  The most important guard. Unbounded full-table scans saturate I/O on
  their own; adding background reads doubles traffic and causes the
  foreground thread to race its own prefetch tasks. Discovered after
  coldCacheScan regressed +92% without it.

  if (sibEnd - sibStart < PREFETCH_MIN_SIBLINGS) return;   // = 8
  Suppresses prefetch at the tail of nearly-exhausted subtrees where
  scheduling overhead exceeds latency hidden. Threshold of 8 chosen
  empirically; below this the ForkJoin submission cost dominates.

  if (fs.isPageCached(firstPos)) return;
  Warm-path short-circuit: if the first candidate sibling is already
  in the page cache the working set is hot; skip entirely to avoid
  unnecessary ForkJoin overhead on repeated warm iterations.

  descentDepth < MAX_PREFETCH_DEPTH   // = 1
  Limits prefetch to the top two levels of each descent. Without this,
  prefetch fires at every internal node on every descent, re-queuing
  the same pages redundantly and causing foreground/background racing.

Upper-bound analysis uses the B-tree's own separator keys to skip
siblings whose entire key range lies outside [from, to], so short
range scans pay only for what they need.

### FileStore changes (FileStore.java)

isPageCached(): package-private cache probe without triggering a load,
used by the warm-path short-circuit above.

prefetchPages(): redesigned submission path:
  - NIO: one ForkJoin task reads all positions sequentially. The
    previous per-page task submission (N round-trips) dominated cost
    for short prefetch lists and caused measurable write-path
    regression.
  - io_uring: one ForkJoin task submits in windows of MAX_PREFETCH_BATCH
    (= 32) SQEs. Windowing prevents ring saturation under concurrent
    read workloads; demand reads can interleave between windows.

chunkReadCacheMaxBytes default: 4 MB → 0. Chunks are large collections
of randomly-scattered pages; the hit rate is low and the memory is
better spent on the page cache.

## Benchmark summary (baseline NIO vs final, single-threaded)

  randomCacheMissReads      +13.1% NIO   +9.7% JUring  ✓
  compositeIndexRangeScan    +5.1% NIO  -16.2% JUring  (JUring open)
  secondaryIndexLookup       +2.9% NIO   -6.9% JUring
  mixed (read-heavy)         +0.9% NIO   +6.8% JUring  ✓
  coldCacheScan              -4.9% NIO   -2.6% JUring  ✓  (was +92%)
  indexCreationAndCompaction -3.1% NIO   -1.6% JUring  ✓  (was +25%)
  joinAggregateGroupBy       -9.8% NIO   -2.1% JUring  ✓
  bulkInsert                +17.6% NIO   -0.2% JUring  (NIO write open)
  testUpdate                 -8.9% NIO   -1.0% JUring  (write contention)

JUring benefits most from prefetch due to true kernel-level batching.
NIO write-path regressions (bulkInsert, testUpdate) reflect background
task I/O contention during write-heavy operations; further work needed.

Signed-off-by: manticore-projects <andreas@manticore-projects.com>
@manticore-projects
Copy link
Copy Markdown
Contributor Author

@andreitokar Please do you have any idea how to "stabilize" the bulkInsert benchmark? It shows random abysmal deviations and I all my magic without achieving much:

H2JuringBenchmark.bulkInsert                       nio     ss   14     2040.289 ±   957.705  ms/op
H2JuringBenchmark.bulkInsert                    juring     ss   14     1730.756 ±    20.929  ms/op
H2JuringBenchmark.bulkInsertWithIndex              nio     ss   14     1949.949 ±    92.450  ms/op
H2JuringBenchmark.bulkInsertWithIndex           juring     ss   14     1959.359 ±    77.436  ms/op
Iteration   1: /run/media/are/test/juring-test: 258.1 MiB (270675968 bytes) trimmed
4697.761 ms/op
Iteration   2: /run/media/are/test/juring-test: 260.7 MiB (273403904 bytes) trimmed
3098.378 ms/op
Iteration   3: /run/media/are/test/juring-test: 260.7 MiB (273403904 bytes) trimmed
1730.942 ms/op
Iteration   4: /run/media/are/test/juring-test: 260.5 MiB (273203200 bytes) trimmed
1695.465 ms/op
Iteration   5: /run/media/are/test/juring-test: 260.5 MiB (273203200 bytes) trimmed
1702.595 ms/op
Iteration   6: /run/media/are/test/juring-test: 260.7 MiB (273403904 bytes) trimmed
1702.411 ms/op
Iteration   7: /run/media/are/test/juring-test: 260.7 MiB (273403904 bytes) trimmed
1707.768 ms/op

The first and second iteration render this benchmark useless.

…iority gate

Prior to this change, prefetchPages() submitted background read tasks to
ForkJoinPool.commonPool(). The write-path parallel serialization
(serializeToBuffer Phase 1) also runs on commonPool via IntStream.parallel().
Under write-heavy workloads, prefetch I/O tasks occupied common-pool threads
and starved write serialization, causing significant regressions:

  bulkInsert            +18%  slower (ms/op)
  highChurnCompaction   +30%  slower (ms/op)
  benchmarkDBCreation   +38%  slower (ms/op)
  mixed:testUpdate       -9%  throughput loss

Changes:

1. Dedicated prefetchExecutor (FileStore)
   Replace all ForkJoinPool.commonPool().execute() calls in prefetchPage()
   and prefetchPages() with a single-threaded ThreadPoolExecutor:
   - Thread.MIN_PRIORITY daemon thread ("H2-prefetch")
   - Bounded ArrayBlockingQueue(4) + DiscardOldestPolicy: stale prefetch
     requests are silently dropped under write pressure rather than queuing
   - Shut down in close() before layout map is closed

2. Write-priority gate (FileStore.isWritePipelineBusy)
   New isWritePipelineBusy() checks serializationExecutor and
   bufferSaveExecutor queue depths. If either has pending work, prefetchPages()
   returns immediately, ceding full I/O bandwidth to the write path.

3. Mid-task abort
   NIO sequential prefetch loop and io_uring batch windows both re-check
   isWritePipelineBusy() between pages/windows and abort early if writes arrive
   after the task has already started.

Results vs baseline (JMH 1.37, JDK 25, single-threaded, G1GC):

  indexCreationAndCompaction   +25%  (NIO),  +15%  (juring)  🟢
  coldCacheScan                 +7%  (NIO)                    ✅
  compositeIndexRangeScan       +6%  (NIO)                    ✅
  randomCacheMissReads          +4%  (NIO),   +5%  (juring)  ✅
  bulkInsert                    +1%  (NIO)   [regression resolved]
  highChurnCompaction           -1%  (NIO)   [regression resolved]
  benchmarkDBCreation           +1%  (NIO)   [regression resolved]

Known open issues (pre-existing, not introduced here):
  joinChurnCompaction   -15%  on both backends (under investigation)

Signed-off-by: manticore-projects <andreas@manticore-projects.com>
@manticore-projects
Copy link
Copy Markdown
Contributor Author

manticore-projects commented Mar 2, 2026

We are getting there: READ has now all advantages and WRITE is on par.

Write Benchmarks

Benchmark Units Baseline NIO Latest Δ NIO juring Latest Δ juring
bulkInsert ms/op 1,734.2 1,723.8 +0.6% ➖ 1,762.6 -1.6% ➖
bulkInsertWithIndex ms/op 1,821.3 1,914.5 -4.9% 🟡 1,927.1 -5.5% 🟡
mixed:testUpdate ops/s 147,850 140,945 -4.7% 🟡 139,790 -5.5% 🟡
highChurnCompaction ms/op 1,735.1 1,745.6 -0.6% ➖ 1,804.9 -3.9% 🟡
benchmarkDBCreation ms/op 6.85 6.81 +0.5% ➖ 7.41 -7.6% 🟡
joinChurnCompaction ms/op 530.6 626.9 -15.4% 🔴 617.5 -14.1% 🔴
indexCreationAndCompaction ms/op 12,485.4 9,961.9 +25.3% 🟢 10,861.8 +14.9% 🟢

Read / Scan Benchmarks

Benchmark Units Baseline NIO Latest Δ NIO juring Latest Δ juring
compositeIndexRangeScan ops/s 192.3 202.9 +5.5% ✅ 189.5 -1.4% ➖
randomCacheMissReads ops/s 146,016 152,057 +4.1% ✅ 153,702 +5.3% ✅
secondaryIndexLookup ops/s 116.7 111.9 -4.1% 🟡 116.3 -0.3% ➖
coldCacheScan ms/op 698.7 650.6 +7.4% ✅ 712.4 -1.9% ➖
mixed (total) ops/s 1,823,488 1,887,557 +3.5% ✅ 1,885,723 +3.4% ✅
mixed:testSelect ops/s 1,675,638 1,746,612 +4.2% ✅ 1,745,933 +4.2% ✅
joinAggregateGroupBy ms/op 1,367.4 1,312.8 +4.2% ✅ 1,279.7 +6.9% ✅
joinIndexedNestedLoop ms/op 508.2 490.7 +3.6% ✅ 531.5 -4.4% 🟡
joinThreeWay ms/op 992.9 996.9 -0.4% ➖ 1,018.2 -2.5% 🟡
compaction ms/op 398.6 408.0 -2.3% 🟡 393.4 +1.3% ➖

@manticore-projects
Copy link
Copy Markdown
Contributor Author

How can we move this forward please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants