Skip to content

fix: Cap memory consumption during merge#4893

Open
rebasedming wants to merge 3 commits into
mainfrom
cap-merge-memory-by-mwm
Open

fix: Cap memory consumption during merge#4893
rebasedming wants to merge 3 commits into
mainfrom
cap-merge-memory-by-mwm

Conversation

@rebasedming
Copy link
Copy Markdown
Collaborator

@rebasedming rebasedming commented Apr 25, 2026

Ticket(s) Closed

  • Closes #

What

Twice in prod now we've seen situations where very few segments get created because the target segment count defaults to CPU count. Once, this led to a very skewed index (where one segment contained most of the matches for a particularly query) and another time it led to OOM because merging down to a few huge segments consumes more memory.

The solution is to try and cap merges to maintenance_work_mem. It's not a perfect heuristic, but during CREATE INDEX we can set the target segment count such that merges don't exceed the memory budget, and during normal merge we can skip merging segments that exceed the budget.

Why

How

Tests

Derive a per-merge doc budget from `maintenance_work_mem` (using a
hardcoded `MERGE_BYTES_PER_DOC = 1024` estimate) and enforce it in two
places:

- `LayeredMergePolicy` splits a layer's candidates whenever the running
  doc count would exceed the budget, and skips segments that exceed the
  budget on their own rather than emitting degenerate single-segment
  candidates.
- `WorkerBuildState::try_merge` trims its chunk size to fit the budget
  before draining, guaranteeing at least 2 segments per merge so we keep
  making progress.

Adds three new unit tests covering the split, oversized-segment, and
loose-budget cases.
…merge

Move the per-merge doc-budget enforcement out of `try_merge` and into
`adjusted_target_segment_count`: floor the segment count by
`ceil(reltuples / merge_doc_budget())` so the per-worker chunking math
naturally produces merges that fit in `maintenance_work_mem`.

This is cleaner than the runtime cap — there's one place that knows
about the budget instead of two, no "what if bounded chunk_size < 2"
edge case, and the final segment count matches what the planner says
upfront.

The `LayeredMergePolicy` doc-budget enforcement stays as-is; it's the
safety net for INSERT/VACUUM/background merges that don't go through
`build_parallel`.
@rebasedming rebasedming added the cherry-pick/0.23.x Request that this PR to `main` should get an automatic cherry-pick PR to `0.23.x` after it lands. label Apr 25, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

❌ Patch coverage is 38.46154% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.20%. Comparing base (1706dac) to head (418d8e3).
⚠️ Report is 104 commits behind head on main.

Files with missing lines Patch % Lines
pg_search/src/index/merge_policy.rs 26.22% 45 Missing ⚠️
pg_search/src/postgres/build_parallel.rs 57.14% 3 Missing ⚠️

❌ Your patch check has failed because the patch coverage (38.46%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4893      +/-   ##
==========================================
- Coverage   77.26%   77.20%   -0.06%     
==========================================
  Files         194      194              
  Lines       52121    52188      +67     
==========================================
+ Hits        40269    40294      +25     
- Misses      11852    11894      +42     
Files with missing lines Coverage Δ
pg_search/src/gucs.rs 71.95% <100.00%> (+0.38%) ⬆️
pg_search/src/postgres/merge.rs 70.53% <100.00%> (-0.50%) ⬇️
pg_search/src/postgres/build_parallel.rs 82.74% <57.14%> (-0.34%) ⬇️
pg_search/src/index/merge_policy.rs 53.26% <26.22%> (-4.24%) ⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rebasedming rebasedming marked this pull request as ready for review April 25, 2026 01:05
@rebasedming rebasedming requested a review from a team as a code owner April 25, 2026 01:05
@rebasedming rebasedming requested a review from mdashti April 25, 2026 01:05
Copy link
Copy Markdown
Contributor

@mdashti mdashti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if the comments are resolved as suggested.

Comment thread pg_search/src/gucs.rs
/// value is intentionally conservative: tantivy's merge keeps per-field readers
/// and posting buffers in memory, and 1 KB/doc is a rough floor that empirically
/// keeps peak RSS near `maintenance_work_mem` on text-heavy workloads.
const MERGE_BYTES_PER_DOC: usize = 1024;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a GUC? 1 KB/doc is a reasonable floor for text-heavy schemas, but fat-JSON or mixed-fast-field schemas consume way more, and numeric-only ones way less. If the estimate is wrong in prod, we'd want a knob to tune without redeploying — paradedb.merge_bytes_per_doc would fit right next to the other GUCs in this file.

// memory cap. The next layer (or a future merge cycle) can deal with it.
if segment_doc_count > self.doc_budget {
continue;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick sanity check on "the next layer (or a future merge cycle) can deal with it" — doc_budget is constant across layers, so if a segment has more docs than the budget here, it'll have more docs than the budget at every layer. We're not deferring it, we're stranding it. That's fine as an OOM cap, but it does mean an oversized segment that accumulates dead docs can't be compacted via merge anymore. Vacuum becomes the only path to reclaim those docs. Worth either softening the comment or carving an exception for single-segment compactions of oversized segments with deletes.

let min_segments_for_budget = (reltuples / doc_budget as f64).ceil() as usize;
if min_segments_for_budget > target_segment_count {
pgrx::debug1!(
"bumping target_segment_count from {target_segment_count} to {min_segments_for_budget} to keep per-merge docs under merge_doc_budget ({doc_budget}); reltuples: {reltuples}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user explicitly asked for a target_segment_count in the WITH clause and we silently override it. A debug1! is invisible to almost everyone running CREATE INDEX. I'd promote this to notice! so people understand why their index has more segments than they configured — otherwise the next bug report is going to be "why does my target=4 index have 17 segments."

// one shot. If the user asked for fewer segments than the doc budget allows, the
// per-worker chunking math would eventually try to merge more docs than fit in memory;
// bumping the target up means each chunk naturally stays under [`gucs::merge_doc_budget`].
let doc_budget = gucs::merge_doc_budget();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With N parallel workers, each one can run a merge pulling in merge_doc_budget docs concurrently, so peak memory across workers is roughly N × mwm. adjust_maintenance_work_mem already splits the writer budget by nlaunched, but this floor uses the full mwm. Is that intentional? My naive read is that the build-time floor should use mwm / nlaunched so parallel-worker peak memory stays near mwm — but I might be missing something about how tantivy schedules merge memory across workers.

// per-worker chunking math would eventually try to merge more docs than fit in memory;
// bumping the target up means each chunk naturally stays under [`gucs::merge_doc_budget`].
let doc_budget = gucs::merge_doc_budget();
let min_segments_for_budget = (reltuples / doc_budget as f64).ceil() as usize;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reltuples is whatever pg_class has, and it's frequently stale or zero right after a bulk load without ANALYZE. The fallback path estimates from page count which is also rough. If reltuples comes in much higher than reality, min_segments_for_budget blows up and we end up with way more segments than needed. Worth a sanity cap — maybe 10× the user's target_segment_count — so a bad statistics estimate doesn't shred the segment topology.

assert_eq!(candidates[0].0.len(), 2);
assert_eq!(largest_layer_size, 1000);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A regression test that builds an index over a large-ish table with a small maintenance_work_mem and asserts the bumped segment count would catch regressions and also document the user-visible behavior change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick/0.23.x Request that this PR to `main` should get an automatic cherry-pick PR to `0.23.x` after it lands.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants