fix: Cap memory consumption during merge#4893
Conversation
Derive a per-merge doc budget from `maintenance_work_mem` (using a hardcoded `MERGE_BYTES_PER_DOC = 1024` estimate) and enforce it in two places: - `LayeredMergePolicy` splits a layer's candidates whenever the running doc count would exceed the budget, and skips segments that exceed the budget on their own rather than emitting degenerate single-segment candidates. - `WorkerBuildState::try_merge` trims its chunk size to fit the budget before draining, guaranteeing at least 2 segments per merge so we keep making progress. Adds three new unit tests covering the split, oversized-segment, and loose-budget cases.
…merge Move the per-merge doc-budget enforcement out of `try_merge` and into `adjusted_target_segment_count`: floor the segment count by `ceil(reltuples / merge_doc_budget())` so the per-worker chunking math naturally produces merges that fit in `maintenance_work_mem`. This is cleaner than the runtime cap — there's one place that knows about the budget instead of two, no "what if bounded chunk_size < 2" edge case, and the final segment count matches what the planner says upfront. The `LayeredMergePolicy` doc-budget enforcement stays as-is; it's the safety net for INSERT/VACUUM/background merges that don't go through `build_parallel`.
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (38.46%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #4893 +/- ##
==========================================
- Coverage 77.26% 77.20% -0.06%
==========================================
Files 194 194
Lines 52121 52188 +67
==========================================
+ Hits 40269 40294 +25
- Misses 11852 11894 +42
🚀 New features to boost your workflow:
|
mdashti
left a comment
There was a problem hiding this comment.
LGTM if the comments are resolved as suggested.
| /// value is intentionally conservative: tantivy's merge keeps per-field readers | ||
| /// and posting buffers in memory, and 1 KB/doc is a rough floor that empirically | ||
| /// keeps peak RSS near `maintenance_work_mem` on text-heavy workloads. | ||
| const MERGE_BYTES_PER_DOC: usize = 1024; |
There was a problem hiding this comment.
Should this be a GUC? 1 KB/doc is a reasonable floor for text-heavy schemas, but fat-JSON or mixed-fast-field schemas consume way more, and numeric-only ones way less. If the estimate is wrong in prod, we'd want a knob to tune without redeploying — paradedb.merge_bytes_per_doc would fit right next to the other GUCs in this file.
| // memory cap. The next layer (or a future merge cycle) can deal with it. | ||
| if segment_doc_count > self.doc_budget { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Quick sanity check on "the next layer (or a future merge cycle) can deal with it" — doc_budget is constant across layers, so if a segment has more docs than the budget here, it'll have more docs than the budget at every layer. We're not deferring it, we're stranding it. That's fine as an OOM cap, but it does mean an oversized segment that accumulates dead docs can't be compacted via merge anymore. Vacuum becomes the only path to reclaim those docs. Worth either softening the comment or carving an exception for single-segment compactions of oversized segments with deletes.
| let min_segments_for_budget = (reltuples / doc_budget as f64).ceil() as usize; | ||
| if min_segments_for_budget > target_segment_count { | ||
| pgrx::debug1!( | ||
| "bumping target_segment_count from {target_segment_count} to {min_segments_for_budget} to keep per-merge docs under merge_doc_budget ({doc_budget}); reltuples: {reltuples}" |
There was a problem hiding this comment.
The user explicitly asked for a target_segment_count in the WITH clause and we silently override it. A debug1! is invisible to almost everyone running CREATE INDEX. I'd promote this to notice! so people understand why their index has more segments than they configured — otherwise the next bug report is going to be "why does my target=4 index have 17 segments."
| // one shot. If the user asked for fewer segments than the doc budget allows, the | ||
| // per-worker chunking math would eventually try to merge more docs than fit in memory; | ||
| // bumping the target up means each chunk naturally stays under [`gucs::merge_doc_budget`]. | ||
| let doc_budget = gucs::merge_doc_budget(); |
There was a problem hiding this comment.
With N parallel workers, each one can run a merge pulling in merge_doc_budget docs concurrently, so peak memory across workers is roughly N × mwm. adjust_maintenance_work_mem already splits the writer budget by nlaunched, but this floor uses the full mwm. Is that intentional? My naive read is that the build-time floor should use mwm / nlaunched so parallel-worker peak memory stays near mwm — but I might be missing something about how tantivy schedules merge memory across workers.
| // per-worker chunking math would eventually try to merge more docs than fit in memory; | ||
| // bumping the target up means each chunk naturally stays under [`gucs::merge_doc_budget`]. | ||
| let doc_budget = gucs::merge_doc_budget(); | ||
| let min_segments_for_budget = (reltuples / doc_budget as f64).ceil() as usize; |
There was a problem hiding this comment.
reltuples is whatever pg_class has, and it's frequently stale or zero right after a bulk load without ANALYZE. The fallback path estimates from page count which is also rough. If reltuples comes in much higher than reality, min_segments_for_budget blows up and we end up with way more segments than needed. Worth a sanity cap — maybe 10× the user's target_segment_count — so a bad statistics estimate doesn't shred the segment topology.
| assert_eq!(candidates[0].0.len(), 2); | ||
| assert_eq!(largest_layer_size, 1000); | ||
| } | ||
| } |
There was a problem hiding this comment.
A regression test that builds an index over a large-ish table with a small maintenance_work_mem and asserts the bumped segment count would catch regressions and also document the user-visible behavior change.
Ticket(s) Closed
What
Twice in prod now we've seen situations where very few segments get created because the target segment count defaults to CPU count. Once, this led to a very skewed index (where one segment contained most of the matches for a particularly query) and another time it led to OOM because merging down to a few huge segments consumes more memory.
The solution is to try and cap merges to
maintenance_work_mem. It's not a perfect heuristic, but duringCREATE INDEXwe can set the target segment count such that merges don't exceed the memory budget, and during normal merge we can skip merging segments that exceed the budget.Why
How
Tests