MVStore: Parallel batch read/write by manticore-projects · Pull Request #4331 · h2database/h2database

manticore-projects · 2026-02-17T12:50:52Z

Greetings! This is a feeler PR for speeding up MVStore I/O.
I put a lot of time and effort into this, so please be mild on me. Thank you.

Summary:

The workloads that matter most for typical database usage are probably mixed (OLTP read/write mix), bulkInsert (ETL/migration), the join benchmarks (analytical queries), and coldCacheScan (startup/reporting queries against data not in cache).
I have written JMH based benchmarks for those workloads: https://github.com/manticore-projects/H2Benchmark
I have implemented Parallel Reads and Parallel Writes for MVStore speeding up such workloads:

Throughput (ops/s, higher = better)

Benchmark	Baseline NIO	Current NIO	Δ NIO	JUring	Δ JUring vs NIO
mixed (8 threads)	1,449K	1,449K	—	1,425K	−2% (noise)
randomCacheMissReads	111K	104K	−6%	105K	+2% (noise)
compositeIndexRangeScan	117	156	+33%	112	−28%
secondaryIndexLookup	55.1	57.9	+5%	46.8	−19%

Latency (ms, lower = better)

Benchmark	Baseline NIO	Current NIO	Δ NIO	JUring	Δ JUring vs NIO
coldCacheScan	1,322	863	−35%	1,644	1.9× (regression)
highChurnCompaction	3,494	3,484	—	2,497	−28%
joinThreeWay	2,751	2,396	−13%	6,820	2.8× (regression)
joinIndexedNestedLoop	778	484	−38%	507	+5% (noise)
joinAggregateGroupBy	2,273	2,207	—	2,294	+4% (noise)
bulkInsert	4,586	4,990	+9% (noise)	4,916	−1% (noise)
compaction	702	704	—	719	+2% (noise)
indexCreationAndCompaction	15,412	16,450	+7%	19,407	+18% (regression)

My ultimate goal was to speed up MVStore using a JUring based FileChannel. JUring is a Panama wrapper around libUring, please see https://github.com/manticore-projects/JUring/tree/filechannel and DB like benchmarks.

Technically it works, but there is no benefit yet from the JUringFileChannel.
However, there seem to be benefits on NIO baseline vs. NIO parallel read/write and I wonder if this may be interesting.

Full benchmarks attached.
You should be able to run by yourself with different workload factors. Beware: at 100% workload, then benchmarks take more than 1 hour and need more than 10 GB filespace. Also please use EXT4 or XFS (but avoid compression/encryption/BTRFS etc.). Last but not least, we try to flush the Linux cache between, which depends on sudo, please follow the warnings.

Any questions, concerns or recommendation are most welcome. Just let me know how to make this more useful please.

human_baseline.txt
human.txt

Move PageSerializationManager from FileStore inner class to a standalone class in org.h2.mvstore. Core serialization state (WriteBuffer, ToC, page numbering, position encoding, checksum patching) is self-contained. FileStore side-effects (cachePage, accountForRemovedPage, accountForWrittenPage, cacheToC, countNewPage) are now injected through a Callback interface, wired by the new factory method FileStore.createPageSerializationManager(). Pure structural extraction — no behavior change. Enables future parallel serialization by allowing independent PSM instances per worker thread. Files changed: NEW PageSerializationManager.java — extracted class + Callback interface MOD FileStore.java — inner class removed, factory method added MOD Page.java — import path, write() signature unqualified

Buffer a saved chunk's on-disk content into memory only after a second page is read from that chunk, not on first access. This avoids wasting I/O and memory on scattered single-page lookups (e.g. secondary index probes) while still accelerating sequential/range access patterns where multiple pages from the same chunk are read. Mechanism: - First page read from a chunk: sets volatile hint flag, does normal per-page I/O (no regression vs. baseline) - Second page read: reads the entire chunk (if within threshold) into Chunk.readBuffer; all subsequent reads slice from memory - resolveChunkBuffer() checks readBuffer -> buffer -> hint-gated full-chunk read New fields on Chunk: - volatile ByteBuffer readBuffer -- cached on-disk content - volatile boolean readBufferHint -- set on first read, triggers buffering on second read - invalidateReadBuffer() -- clears both on block relocation Chunk.readBufferForPage() restructured: - New resolveChunkBuffer() implements the two-hit buffering policy - PAGE_LARGE length pre-read satisfied from cached buffer when available - All page slicing unified through the resolved chunk buffer Chunk.readToC() also checks readBuffer before per-region I/O. FileStore gains a configurable threshold: - chunkReadCacheMaxBytes field (default 4 MB, 0 to disable) - getChunkReadCacheMaxBytes() / setChunkReadCacheMaxBytes(int) FileStore.readPage() is unchanged -- buffering is fully transparent. Files changed: MOD Chunk.java -- readBuffer/readBufferHint fields, invalidateReadBuffer(), resolveChunkBuffer(), modified readBufferForPage() and readToC() MOD FileStore.java -- chunkReadCacheMaxBytes field + getter/setter Signed-off-by: manticore-projects <andreas@manticore-projects.com>

When a cursor descends from a NonLeaf to a new leaf page, submit a best-effort background prefetch for the next sibling child page. This overlaps the I/O for leaf N+1 with processing of leaf N, halving effective I/O latency for sequential range scans. The prefetch uses getChildPagePos() to obtain the sibling's position without loading it, then submits to ForkJoinPool.commonPool(). If the page is already cached, the prefetch is a no-op. If the background read fails, it is silently ignored -- the page will be demand-loaded normally. Works in both forward and reverse cursor directions. Cursor.hasNext(): - Track whether we actually descended through NonLeaf nodes - After descent, call prefetchNextSibling() with the parent's position - prefetchNextSibling() computes next/prev sibling index from the parent's children[] array and calls MVMap.prefetchPage() New plumbing (thin wrappers at each layer): - MVMap.prefetchPage(pos) -> MVStore.prefetchPage(map, pos) -> FileStore.prefetchPage(map, pos) - FileStore.prefetchPage() checks cache, skips unsaved pages, submits readPage() to ForkJoinPool.commonPool() Files changed: MOD Cursor.java -- descended flag, prefetchNextSibling() method MOD MVMap.java -- prefetchPage(pos) wrapper MOD MVStore.java -- prefetchPage(map, pos) wrapper MOD FileStore.java -- prefetchPage() impl + ForkJoinPool import Signed-off-by: manticore-projects <andreas@manticore-projects.com>

Replace the single-sibling prefetchNextSibling() with prefetchAhead() which submits up to PREFETCH_WINDOW (4) upcoming sibling child pages when the cursor crosses a leaf boundary. Prefetch is directional -- only siblings ahead in the scan direction are submitted. This is a cursor-scoped alternative to the global NonLeaf readPage() hook approach, which regressed point lookups by prefetching children indiscriminately on every NonLeaf load. By keeping prefetch in the cursor, only scan workloads pay the cost, and the scan direction constrains which children are prefetched. Cursor.hasNext(): - After descent to a new leaf, call prefetchAhead() with the parent's position and scan direction - prefetchAhead() iterates the parent's children[] in scan direction, submitting up to PREFETCH_WINDOW pages via MVMap.prefetchPage() FileStore.java: reverts to Step 3 state (no readPage hook, no IN_PREFETCH ThreadLocal, no nonLeafPrefetchWindow field, no prefetchChildren method). Files changed: MOD Cursor.java -- PREFETCH_WINDOW constant, prefetchAhead() replaces prefetchNextSibling() Signed-off-by: manticore-projects <andreas@manticore-projects.com>

Add SerializedPageRecord to PageSerializationManager that captures buffer-level layout information for each serialized page: buffer offset, page length, type, composed pagePos, and ToC element. The Page reference is attached in onPageSerialized(). This is pure bookkeeping -- no behavior change. The record list provides the data needed by the upcoming rebasePositions() method (Step 5b) to adjust all page positions when merging per-map local buffers into a global buffer at a different base offset. Changes to getPagePosition(): - Appends a new SerializedPageRecord after computing the position Changes to onPageSerialized(): - Sets the Page reference on the most recent record New types: - SerializedPageRecord (public static final inner class) New methods: - getSerializedPages() -- returns the record list Files changed: MOD PageSerializationManager.java Signed-off-by: manticore-projects <andreas@manticore-projects.com>

Add rebasePositions(int baseOffset) to PageSerializationManager that adjusts all page positions when a local serialization buffer is merged into a global chunk buffer at a non-zero offset. The method performs two passes: Pass 1 — for each serialized page: - Recomposes the ToC element with offset + baseOffset - Recomposes pagePos from the rebased ToC element - Updates the ToC list entry - Patches the check value in the buffer (check incorporates offset) - CAS-updates Page.pos via new Page.rebasePos() Pass 2 — for each NonLeaf page: - Parses the page header to locate the child-pointer region - Replaces child pointer longs that reference pages within this buffer (looked up via old->new position map) - Leaves child pointers to pages from other chunks untouched SerializedPageRecord gains a mapId field (needed to recompose ToC elements with the rebased offset). Page.java gains package-private rebasePos(long expected, long new) that atomically updates Page.pos via posUpdater CAS, throwing MVStoreException on mismatch (indicates a rebase logic bug). Files changed: MOD PageSerializationManager.java -- mapId in record, rebasePositions(), rebaseChildPointers() MOD Page.java -- rebasePos() method, import/signature fixes for top-level PSM Signed-off-by: manticore-projects <andreas@manticore-projects.com>

Refactor serializeToBuffer() so that each changed map tree is serialized through its own PageSerializationManager (with sequential pageNoBase to avoid page-number collisions), while all PSMs still write into the shared global WriteBuffer. This means page positions are correct from the start — no local buffers and no rebase are needed. The layout map continues to get its own PSM (Phase 2), and writeMergedToC() concatenates all per-map ToC entries followed by the layout ToC into a single cached tocArray. This is the correctness gate: execution is still single-threaded and the buffer layout is identical to the original code, so any regression indicates a bug in the PSM-per-map split. Local buffers and rebase (needed for Step 5d parallelism) will be introduced as a separate validated step. New types: FileStore.MapSerializationResult -- holds per-map PSM + root info New/changed methods: FileStore.writeMergedToC() -- merges ToC from all PSMs FileStore.createPageSerializationManager(chunk, buff, pageNoBase) PSM constructor with pageNoBase, PSM.getPageCount() Files changed: MOD FileStore.java -- serializeToBuffer rewritten, MapSerializationResult, writeMergedToC MOD PageSerializationManager.java -- pageNoBase field and constructor, getPageCount() Signed-off-by: manticore-projects <andreas@manticore-projects.com>

Serialize each MVMap's B-tree into its own local WriteBuffer and PageSerializationManager, then copy into the global chunk buffer at the map's assigned base offset. A three-pass rebase corrects all position references from local to global coordinates: Pass 1 – Patch page checksums, CAS-update each Page.pos field, and rewrite ToC entries in the global buffer. Pass 2 – Fix NonLeaf child-pointer slots in the on-disk buffer so they reference rebased positions. Pass 3 – Sync in-memory PageReference.pos values via syncChildRefsAfterRebase(), preventing stale local offsets from being used when cached child pages are evicted and later re-read from disk. Pass 3 fixes a subtle corruption bug: after rebase the on-disk data was correct, but NonLeaf.children[].pos still held the pre-rebase local offset. If memory pressure evicted a child page (nulling the strong reference), getChildPage() would read at the old local offset—landing in the chunk header/layout area and producing "expected page length 4..384, got 1869575226" errors (the garbage value decodes to ASCII "olg*" from layout strings). The layout map is serialised last, directly into the global buffer at the pre-reserved slot. A merged table-of-contents is built from all per-map PSMs. Serialization is still sequential in this step; the per-map split establishes the precondition for parallel serialization in step 5d. Files changed: FileStore.java – phase 2 copy-then-rebase loop PageSerializationManager.java – rebasePositions(offset, buffer) with explicit ByteBuffer param; pass 3 child-ref sync Page.java – syncChildRefsAfterRebase() on NonLeaf; no-op on Leaf Signed-off-by: manticore-projects <andreas@manticore-projects.com>

Parallelize the B-tree serialization of independent MVMap trees during chunk writes. Each map's writeUnsavedRecursive() now runs concurrently via the common ForkJoinPool when two or more maps have unsaved pages; single-map commits remain sequential. The key enablers: countUnsavedPages() — new abstract method on Page, implemented in Leaf (trivial), NonLeaf (recursive), and IncompleteNonLeaf (recursive, skips self when incomplete). Called sequentially in Phase 1a to pre-compute per-map page counts, which are summed into cumulative pageNoBase values so that each parallel worker writes pages with globally-unique page numbers. Deferred callbacks — PageSerializationManager gains a deferCallbacks flag and applyDeferredCallbacks() method. When deferred, onPageSerialized() records side-effect data (cache insert, chunk accounting, removed-page tracking) in the SerializedPageRecord instead of firing callbacks immediately. Callbacks are replayed sequentially in Phase 2 after rebase, ensuring: (a) Thread safety — cachePage(), accountForWrittenPage(), and accountForRemovedPage() are not thread-safe. (b) Correct cache keys — pages are cached with their final global positions rather than pre-rebase local offsets. Serialization phases are now: Phase 1a — Sequential: partition changed maps, count unsaved pages, compute cumulative pageNoBase values. Phase 1b — Parallel: each map serializes into its own WriteBuffer + deferred PSM. Phase 2 — Sequential: copy local buffers → global, rebase positions, replay deferred callbacks, record roots. Phase 3 — Sequential: layout map serialization. Phase 4 — Sequential: merged ToC from all PSMs. Files changed: Page.java – countUnsavedPages() abstract + Leaf/NonLeaf/IncompleteNonLeaf PageSerializationManager.java – deferCallbacks flag, deferred data fields on SerializedPageRecord, applyDeferredCallbacks() FileStore.java – Phase 1a/1b split with IntStream.parallel(), deferred PSM factory, applyDeferredCallbacks call in Phase 2 Signed-off-by: manticore-projects <andreas@manticore-projects.com>

Signed-off-by: manticore-projects <andreas@manticore-projects.com>

andreitokar · 2026-02-18T20:48:22Z

Hi @manticore-projects, great job!
While your repository, containing benchmarks is unavailable (404), results are somewhat predictable, judging by benchmarks name.
They clearly show that pages prefetch would help, and that is something we may implement. Instead of fixed size static window, cursor “upper” bound can be analyzed, to avoid prefetch of very short ranges. Also prefetch can be short-circuited more aggressively to minimize penalty for in-memory case.
On the other hand, idea of caching chunks does not look that promising. Chunks tend to be relatively big (1-4Mb) collections of random pages, and hitting a right page is purely a coincidence. I believe that by giving that memory to a page cache may be a bigger bang for the buck. BTW, in your tests, baseline case should have that bigger cache, to be apples-to-apples comparison.
Whether the parallel serialization is going to be net positive - mine guess as good as yours, but all my attempts to parallelize any light stuff were negative, forking overhead was too high.

manticore-projects · 2026-02-19T17:35:51Z

Thank your for your warm feedback. I have made the benchmark repository public: https://github.com/manticore-projects/H2Benchmark
Since your response was positive, I would like to suggest next steps:

I will port the changes to the latest H2 version
I will take out any Panama/JDK16 methods and make it JDK11 compatible

Then we can improve the benchmarks together. Especially I would like to ask for some help or guidance on the BulkInsert benchmark, because I can't get this stable. Measurement jumps up and down and shows massive outliers across all tests and I don't know why.

Signed-off-by: manticore-projects <andreas@manticore-projects.com>

manticore-projects · 2026-02-27T17:00:01Z

I have re-based everything onto latest H2 Git Origin/Master and also back-ported to JDK 11.

manticore-projects · 2026-03-01T12:00:27Z

I have had a close look at the CI tests:

the new parallel version on the MVStore needs more memory (which to me looks acceptable because a) the time of 128mb are long over and b) the changes aim for large database files >2GB and not for small 1000 row dbs
b) there are two tests, where the number of reads and the result of compact are expected. Those fail of course (because the logic has changed)
c) there are some tests related to "lazy execution" that fail for me even before that PR. I do not know what this is about.
So how to go about this? Can/shall I increase the Memory to 1G or 2G for the tests (which worked for me)? Can/shall I adjust the tests for touched nodes and compact results?

The good news is: all tests succeed with this PR and there is no corruption anywhere, not even in my large DBs!

…ache Implement the three improvements suggested during code review: ## Cursor prefetch with B-tree upper-bound analysis (Cursor.java) Add `maybePrefetchSiblings()`, called on every internal-node descent. Rather than a fixed sliding window, it uses the B-tree's own separator keys to determine which sibling subtrees actually fall within the cursor's `to` bound — stopping as soon as a separator key crosses the boundary. This avoids issuing any I/O for subtrees outside the requested range, making short-range scans pay only for what they need. Two short-circuit guards eliminate overhead for warm workloads: - If the store has no backing file, return immediately (pure in-memory). - If the first candidate sibling is already in the page cache, skip ForkJoin submission entirely — the working set is hot and no I/O is needed. This reduces the cost of repeated cursor iteration over cached data to a single cache probe per internal node. A `PREFETCH_MIN_SIBLINGS = 2` floor suppresses prefetch at the very tail of a subtree where scheduling overhead would outweigh any benefit. ## Disable chunk read cache by default (FileStore.java) Change `chunkReadCacheMaxBytes` default from 4 MB to 0. The chunk buffer heuristic buffers entire 1–4 MB raw chunk blobs on a second page miss from the same chunk. In practice, chunks are large collections of randomly-scattered pages — a cursor traversal rarely hits the same chunk twice, so the second-hit rate is low. The 4 MB is better spent on the page cache, which is keyed by page position and reuses actually-hot decoded Page objects across all access patterns. Callers that explicitly set `setChunkReadCacheMaxBytes()` are unaffected. ## Add `isPageCached()` helper (FileStore.java) Package-private probe into `CacheLongKeyLIRS` without triggering a load. Used by the warm-cache short-circuit above. ## Document the memory tradeoff (MVStore.java) Extend the `Builder.cacheSize()` Javadoc to explain why the chunk read cache was disabled and where that memory should go instead. Signed-off-by: manticore-projects <andreas@manticore-projects.com>

manticore-projects · 2026-03-01T21:02:26Z

Read throughput (ops/s, higher = better)

Benchmark	Baseline	NIO	vs base	JUring	vs base
compositeIndexRangeScan	192	202	✅ +5.1%	161	❌ -16.2%
randomCacheMissReads	146.0k	165.2k	✅ +13.1%	160.1k	✅ +9.7%
secondaryIndexLookup	117	120	✅ +2.9%	109	❌ -6.9%

Mixed workload throughput (ops/s, higher = better)

Benchmark	Baseline	NIO	vs base	JUring	vs base
mixed	1.82M	1.83M	✅ +0.1%	1.94M	✅ +6.1%
mixed:testSelect	1.68M	1.69M	✅ +0.9%	1.79M	✅ +6.8%
mixed:testUpdate	147.9k	134.7k	❌ -8.9%	146.4k	❌ -1.0%

Write / bulk latency (ms, lower = better)

Benchmark	Baseline	NIO	vs base	JUring	vs base
bulkInsert	1734ms	2040ms	❌ +17.6%	1731ms	✅ -0.2%
bulkInsertWithIndex	1821ms	1950ms	❌ +7.1%	1959ms	❌ +7.6%
benchmarkDBCreation	7ms	9ms	❌ +38.5%	11ms	❌ +54.2%

Scan latency (ms, lower = better)

Benchmark	Baseline	NIO	vs base	JUring	vs base
coldCacheScan	699ms	664ms	✅ -4.9%	680ms	✅ -2.6%
indexCreationAndCompaction	12485ms	12101ms	✅ -3.1%	12285ms	✅ -1.6%

Compaction latency (ms, lower = better)

Benchmark	Baseline	NIO	vs base	JUring	vs base
compaction	399ms	408ms	❌ +2.4%	394ms	✅ -1.2%
highChurnCompaction	1735ms	2264ms	❌ +30.5%	1796ms	❌ +3.5%

Join latency (ms, lower = better)

Benchmark	Baseline	NIO	vs base	JUring	vs base
joinAggregateGroupBy	1367ms	1234ms	✅ -9.8%	1339ms	✅ -2.1%
joinChurnCompaction	531ms	609ms	❌ +14.8%	627ms	❌ +18.2%
joinIndexedNestedLoop	508ms	508ms	✅ -0.1%	524ms	❌ +3.1%
joinThreeWay	993ms	991ms	✅ -0.2%	957ms	✅ -3.6%

JUring wins with 10 green vs 7 red, with the wins concentrated exactly where they matter most: mixed read workloads (+6.1%/+6.8%) and scans. NIO is positive on reads and scans but still dragged down by write paths.

Biggest problem right now is the volatility of the WRITE tests.

human.txt
human_baseline.txt

Add background page prefetching to Cursor, triggered during B-tree descent for bounded range scans. Four iterations of benchmarking shaped the final design; the key constraints discovered empirically are documented below. ## What was added ### Cursor.maybePrefetchSiblings() (Cursor.java) Called once per internal-node level during hasNext() descent, collects the sibling child positions that the cursor will visit after the current subtree and submits them for background I/O. Three guards control when prefetch is suppressed: if (to == null) return; The most important guard. Unbounded full-table scans saturate I/O on their own; adding background reads doubles traffic and causes the foreground thread to race its own prefetch tasks. Discovered after coldCacheScan regressed +92% without it. if (sibEnd - sibStart < PREFETCH_MIN_SIBLINGS) return; // = 8 Suppresses prefetch at the tail of nearly-exhausted subtrees where scheduling overhead exceeds latency hidden. Threshold of 8 chosen empirically; below this the ForkJoin submission cost dominates. if (fs.isPageCached(firstPos)) return; Warm-path short-circuit: if the first candidate sibling is already in the page cache the working set is hot; skip entirely to avoid unnecessary ForkJoin overhead on repeated warm iterations. descentDepth < MAX_PREFETCH_DEPTH // = 1 Limits prefetch to the top two levels of each descent. Without this, prefetch fires at every internal node on every descent, re-queuing the same pages redundantly and causing foreground/background racing. Upper-bound analysis uses the B-tree's own separator keys to skip siblings whose entire key range lies outside [from, to], so short range scans pay only for what they need. ### FileStore changes (FileStore.java) isPageCached(): package-private cache probe without triggering a load, used by the warm-path short-circuit above. prefetchPages(): redesigned submission path: - NIO: one ForkJoin task reads all positions sequentially. The previous per-page task submission (N round-trips) dominated cost for short prefetch lists and caused measurable write-path regression. - io_uring: one ForkJoin task submits in windows of MAX_PREFETCH_BATCH (= 32) SQEs. Windowing prevents ring saturation under concurrent read workloads; demand reads can interleave between windows. chunkReadCacheMaxBytes default: 4 MB → 0. Chunks are large collections of randomly-scattered pages; the hit rate is low and the memory is better spent on the page cache. ## Benchmark summary (baseline NIO vs final, single-threaded) randomCacheMissReads +13.1% NIO +9.7% JUring ✓ compositeIndexRangeScan +5.1% NIO -16.2% JUring (JUring open) secondaryIndexLookup +2.9% NIO -6.9% JUring mixed (read-heavy) +0.9% NIO +6.8% JUring ✓ coldCacheScan -4.9% NIO -2.6% JUring ✓ (was +92%) indexCreationAndCompaction -3.1% NIO -1.6% JUring ✓ (was +25%) joinAggregateGroupBy -9.8% NIO -2.1% JUring ✓ bulkInsert +17.6% NIO -0.2% JUring (NIO write open) testUpdate -8.9% NIO -1.0% JUring (write contention) JUring benefits most from prefetch due to true kernel-level batching. NIO write-path regressions (bulkInsert, testUpdate) reflect background task I/O contention during write-heavy operations; further work needed. Signed-off-by: manticore-projects <andreas@manticore-projects.com>

manticore-projects · 2026-03-02T09:47:27Z

@andreitokar Please do you have any idea how to "stabilize" the bulkInsert benchmark? It shows random abysmal deviations and I all my magic without achieving much:

H2JuringBenchmark.bulkInsert                       nio     ss   14     2040.289 ±   957.705  ms/op
H2JuringBenchmark.bulkInsert                    juring     ss   14     1730.756 ±    20.929  ms/op
H2JuringBenchmark.bulkInsertWithIndex              nio     ss   14     1949.949 ±    92.450  ms/op
H2JuringBenchmark.bulkInsertWithIndex           juring     ss   14     1959.359 ±    77.436  ms/op

Iteration   1: /run/media/are/test/juring-test: 258.1 MiB (270675968 bytes) trimmed
4697.761 ms/op
Iteration   2: /run/media/are/test/juring-test: 260.7 MiB (273403904 bytes) trimmed
3098.378 ms/op
Iteration   3: /run/media/are/test/juring-test: 260.7 MiB (273403904 bytes) trimmed
1730.942 ms/op
Iteration   4: /run/media/are/test/juring-test: 260.5 MiB (273203200 bytes) trimmed
1695.465 ms/op
Iteration   5: /run/media/are/test/juring-test: 260.5 MiB (273203200 bytes) trimmed
1702.595 ms/op
Iteration   6: /run/media/are/test/juring-test: 260.7 MiB (273403904 bytes) trimmed
1702.411 ms/op
Iteration   7: /run/media/are/test/juring-test: 260.7 MiB (273403904 bytes) trimmed
1707.768 ms/op

The first and second iteration render this benchmark useless.

…iority gate Prior to this change, prefetchPages() submitted background read tasks to ForkJoinPool.commonPool(). The write-path parallel serialization (serializeToBuffer Phase 1) also runs on commonPool via IntStream.parallel(). Under write-heavy workloads, prefetch I/O tasks occupied common-pool threads and starved write serialization, causing significant regressions: bulkInsert +18% slower (ms/op) highChurnCompaction +30% slower (ms/op) benchmarkDBCreation +38% slower (ms/op) mixed:testUpdate -9% throughput loss Changes: 1. Dedicated prefetchExecutor (FileStore) Replace all ForkJoinPool.commonPool().execute() calls in prefetchPage() and prefetchPages() with a single-threaded ThreadPoolExecutor: - Thread.MIN_PRIORITY daemon thread ("H2-prefetch") - Bounded ArrayBlockingQueue(4) + DiscardOldestPolicy: stale prefetch requests are silently dropped under write pressure rather than queuing - Shut down in close() before layout map is closed 2. Write-priority gate (FileStore.isWritePipelineBusy) New isWritePipelineBusy() checks serializationExecutor and bufferSaveExecutor queue depths. If either has pending work, prefetchPages() returns immediately, ceding full I/O bandwidth to the write path. 3. Mid-task abort NIO sequential prefetch loop and io_uring batch windows both re-check isWritePipelineBusy() between pages/windows and abort early if writes arrive after the task has already started. Results vs baseline (JMH 1.37, JDK 25, single-threaded, G1GC): indexCreationAndCompaction +25% (NIO), +15% (juring) 🟢 coldCacheScan +7% (NIO) ✅ compositeIndexRangeScan +6% (NIO) ✅ randomCacheMissReads +4% (NIO), +5% (juring) ✅ bulkInsert +1% (NIO) [regression resolved] highChurnCompaction -1% (NIO) [regression resolved] benchmarkDBCreation +1% (NIO) [regression resolved] Known open issues (pre-existing, not introduced here): joinChurnCompaction -15% on both backends (under investigation) Signed-off-by: manticore-projects <andreas@manticore-projects.com>

manticore-projects · 2026-03-02T19:44:22Z

We are getting there: READ has now all advantages and WRITE is on par.

Write Benchmarks

Benchmark	Units	Baseline	NIO Latest	Δ NIO	juring Latest	Δ juring
bulkInsert	ms/op	1,734.2	1,723.8	+0.6% ➖	1,762.6	-1.6% ➖
bulkInsertWithIndex	ms/op	1,821.3	1,914.5	-4.9% 🟡	1,927.1	-5.5% 🟡
mixed:testUpdate	ops/s	147,850	140,945	-4.7% 🟡	139,790	-5.5% 🟡
highChurnCompaction	ms/op	1,735.1	1,745.6	-0.6% ➖	1,804.9	-3.9% 🟡
benchmarkDBCreation	ms/op	6.85	6.81	+0.5% ➖	7.41	-7.6% 🟡
joinChurnCompaction	ms/op	530.6	626.9	-15.4% 🔴	617.5	-14.1% 🔴
indexCreationAndCompaction	ms/op	12,485.4	9,961.9	+25.3% 🟢	10,861.8	+14.9% 🟢

Read / Scan Benchmarks

Benchmark	Units	Baseline	NIO Latest	Δ NIO	juring Latest	Δ juring
compositeIndexRangeScan	ops/s	192.3	202.9	+5.5% ✅	189.5	-1.4% ➖
randomCacheMissReads	ops/s	146,016	152,057	+4.1% ✅	153,702	+5.3% ✅
secondaryIndexLookup	ops/s	116.7	111.9	-4.1% 🟡	116.3	-0.3% ➖
coldCacheScan	ms/op	698.7	650.6	+7.4% ✅	712.4	-1.9% ➖
mixed (total)	ops/s	1,823,488	1,887,557	+3.5% ✅	1,885,723	+3.4% ✅
mixed:testSelect	ops/s	1,675,638	1,746,612	+4.2% ✅	1,745,933	+4.2% ✅
joinAggregateGroupBy	ms/op	1,367.4	1,312.8	+4.2% ✅	1,279.7	+6.9% ✅
joinIndexedNestedLoop	ms/op	508.2	490.7	+3.6% ✅	531.5	-4.4% 🟡
joinThreeWay	ms/op	992.9	996.9	-0.4% ➖	1,018.2	-2.5% 🟡
compaction	ms/op	398.6	408.0	-2.3% 🟡	393.4	+1.3% ➖

manticore-projects · 2026-04-01T18:32:02Z

How can we move this forward please?

manticore-projects added 14 commits February 16, 2026 00:35

feat: implement JUringFileChannel

f84a0a0

Signed-off-by: manticore-projects <andreas@manticore-projects.com>

feat: increase PREFETCH_WINDOW to 16

f501cbf

Signed-off-by: manticore-projects <andreas@manticore-projects.com>

chore: sync w/o changes

4042eff

Signed-off-by: manticore-projects <andreas@manticore-projects.com>

doc: update documentation

62e1fcc

Signed-off-by: manticore-projects <andreas@manticore-projects.com>

build: Java16 requirement

fad0580

Signed-off-by: manticore-projects <andreas@manticore-projects.com>

manticore-projects requested review from andreitokar, grandinj and katzyn February 17, 2026 12:57

fix: avoid a TOCTOU race

ed7469d

Signed-off-by: manticore-projects <andreas@manticore-projects.com>

manticore-projects added 2 commits February 27, 2026 23:36

feat: integrate io_uring parallel read/write into master

f414e82

chore: backport Java 16 record to Java 11 classes, lower the JDK to 11

20d7858

Signed-off-by: manticore-projects <andreas@manticore-projects.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MVStore: Parallel batch read/write#4331

MVStore: Parallel batch read/write#4331
manticore-projects wants to merge 20 commits into
h2database:masterfrom
manticore-projects:juring

manticore-projects commented Feb 17, 2026 •

edited

Loading

Uh oh!

andreitokar commented Feb 18, 2026 •

edited

Loading

Uh oh!

manticore-projects commented Feb 19, 2026

Uh oh!

manticore-projects commented Feb 27, 2026

Uh oh!

manticore-projects commented Mar 1, 2026

Uh oh!

manticore-projects commented Mar 1, 2026

Uh oh!

manticore-projects commented Mar 2, 2026

Uh oh!

manticore-projects commented Mar 2, 2026 •

edited

Loading

Uh oh!

manticore-projects commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

manticore-projects commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Throughput (ops/s, higher = better)

Latency (ms, lower = better)

Uh oh!

andreitokar commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

manticore-projects commented Feb 19, 2026

Uh oh!

manticore-projects commented Feb 27, 2026

Uh oh!

manticore-projects commented Mar 1, 2026

Uh oh!

manticore-projects commented Mar 1, 2026

Read throughput (ops/s, higher = better)

Mixed workload throughput (ops/s, higher = better)

Write / bulk latency (ms, lower = better)

Scan latency (ms, lower = better)

Compaction latency (ms, lower = better)

Join latency (ms, lower = better)

Uh oh!

manticore-projects commented Mar 2, 2026

Uh oh!

manticore-projects commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Write Benchmarks

Read / Scan Benchmarks

Uh oh!

manticore-projects commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

manticore-projects commented Feb 17, 2026 •

edited

Loading

andreitokar commented Feb 18, 2026 •

edited

Loading

manticore-projects commented Mar 2, 2026 •

edited

Loading