Skip to content

Range filter for batches#5937

Open
gz wants to merge 2 commits intomainfrom
min-max-batch
Open

Range filter for batches#5937
gz wants to merge 2 commits intomainfrom
min-max-batch

Conversation

@gz
Copy link
Copy Markdown
Contributor

@gz gz commented Mar 27, 2026

Describe Manual Test Plan

Tested manually with a few pipelines.

Checklist

  • Unit tests added/updated
  • Integration tests added/updated
  • Documentation updated
  • Changelog updated

Describe Incompatible Changes

No incompatible changes.

This is a very cheap filter that can speed up
ingest significantly.

The idea is to track the min/max for every batch.
For seek_key_exact if the value we're seeking
is not in side of the range we skip the batch.

Some ingest heavy benchmarks:

just bloom filter:

╭────────────────────────┬───────────┬──────────┬───────╮
│ Metric                 │ Value     │ Lower    │ Upper │
├────────────────────────┼───────────┼──────────┼───────┤
│ Throughput (records/s) │ 2695496   │ -        │ -     │
│ Memory                 │ 14.28 GiB │ 1.78 GiB │ -     │
│ Storage                │ 79.45 GiB │ 110 B    │ -     │
│ Uptime [ms]            │ 302742    │ -        │ -     │
│ State Amplification    │ 0.43      │ -        │ -     │
╰────────────────────────┴───────────┴──────────┴───────╯

range+bloom filter:

╭────────────────────────┬────────────┬──────────┬───────╮
│ Metric                 │ Value      │ Lower    │ Upper │
├────────────────────────┼────────────┼──────────┼───────┤
│ Throughput (records/s) │ 4035088    │ -        │ -     │
│ Memory                 │ 23.4 GiB   │ 2.42 GiB │ -     │
│ Storage                │ 112.51 GiB │ 110 B    │ -     │
│ Uptime [ms]            │ 303292     │ -        │ -     │
│ State Amplification    │ 0.41       │ -        │ -     │
╰────────────────────────┴────────────┴──────────┴───────╯

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
@gz gz requested review from blp and ryzhyk March 27, 2026 03:29
@gz gz changed the title Min max batch Range filter for batches Mar 27, 2026
let mut max = key_factory.default_box();

match root.read::<K, A>(&self.file)? {
TreeBlock::Data(data_block) => unsafe {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a trait

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more importantly needs some eyes from @blp to tell me if this is doing the right thing

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mihaibudiu What do you think should be a trait?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the range should be a trait.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a method on TreeBlock but I don't see the value of defining a trait.

)
.unwrap_storage(),
weight: factories.weight_factory().default_box(),
key_range: None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does None mean? Empty or unknown?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both, I'm not sure if it can ever happen but the APIs are written in a way that such that it's possible

}

/// Extends the upper bound when keys arrive in sorted order.
pub(crate) fn extend_to(&mut self, max: &K) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this is called only for the last value in a batch

}

fn push_key(&mut self, key: &K) {
if let Some(range) = &mut self.key_range {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is called for every element added?
How do you know that this is larger than the min?
Could you have a debug_assert for that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an assert in extend_to; maybe it can be done in a less naive way I'll have a look

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writer1::write0() asserts that the keys that it's called with are in order.

fn seek_key_exact(&mut self, key: &K, hash: Option<u64>) -> bool {
let hash = hash.unwrap_or_else(|| key.default_hash());
if !self.wset.maybe_contains_key(hash) {
if !self.wset.filters.maybe_contains_key(key, hash) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the default_hash() still computed someplace?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, few files down

Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two blockers: (1) two unsafe blocks in reader.rs missing // SAFETY: comments (see inline), and (2) docs.feldera.com/docs/operations/metrics.md should be updated with entries for the four new range filter metrics (range_filter_size_bytes, range_filter_hits_count, range_filter_misses_count, range_filter_hit_rate_percent). Looking at past PRs (e.g., the transaction metrics PR), the pattern is a manual edit to metrics.md alongside the code change.

let mut max = key_factory.default_box();

match root.read::<K, A>(&self.file)? {
TreeBlock::Data(data_block) => unsafe {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing // SAFETY: comment. key_range() is a safe function, so each unsafe {} block inside it needs to document the invariant. For this DataBlock arm: why is calling data_block.key(0, ...) and data_block.key(n_values()-1, ...) sound? (Bounds validity, factory type match.)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point from @blp — if the convention in this file is to propagate unsafe up to the public function signature (rather than wrapping it in a safe function with // SAFETY:), making key_range() unsafe fn with a /// # Safety doc comment works equally well. Either approach satisfies the requirement; I just want the safety contract stated somewhere.

data_block.key(&factories, 0, min.as_mut());
data_block.key(&factories, data_block.n_values() - 1, max.as_mut());
},
TreeBlock::Index(index_block) => unsafe {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing // SAFETY: comment. For the IndexBlock arm: why is index_block.get_bound(0, ...) and get_bound(n_children()*2-1, ...) sound? (Bounds validity, factory type match.)

@gz gz force-pushed the min-max-batch branch from 5a4e8d5 to 27cc389 Compare March 27, 2026 05:22
We add stats for the range filter. This lead to some refactoring:

Since we now have two filters (with another one on the way)
we consolidate the stats into a single struct that can be
re-used across filters.

It also revealed a performance issue with the current filter
stats. Because this function is extremly hot adding the
hit and miss atomics in CachePadded led to a 25% increase
for the ingest benchmark.

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
@gz gz force-pushed the min-max-batch branch from 27cc389 to e987d2e Compare March 27, 2026 05:53
@gz
Copy link
Copy Markdown
Contributor Author

gz commented Mar 27, 2026

some benchmarking revealed that this optimization alone does not help much for e.g., delta lake connectors because keys get ingested mostly at random (maybe some z ordering or liquid clustering stuff would help but who knows)

it helps for connectors that ingest in semi linear orders (e.g., datagen -- maybe postgres/kafka as well)
if the bloom filter can be skipped almost 100% of the time, it's ~30% faster -- this is promising because having the min-max allows us to use bitmaps which can also save a lot of time compared to bloom filters during lookups

let mut max = key_factory.default_box();

match root.read::<K, A>(&self.file)? {
TreeBlock::Data(data_block) => unsafe {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mihaibudiu What do you think should be a trait?

///
/// The bounds are loaded from the root node when first requested and can
/// then be cached by higher-level batch types.
pub fn key_range(&self) -> Result<Option<(Box<K>, Box<K>)>, Error> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've been making the public functions in the reader unsafe if they are unsafe internally, because these functions don't mask the unsafety; they are as unsafe as the functions they call. (The unsafety is because rkyv deserialization is unsafe.)

}

#[test]
fn one_column_key_range() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to have a test.

I think it would be better to add to the existing tests, too. The new addition would be analogous to test_bloom(), which is also a check that only applies to the first column in a file. I'd expect that it would use expected0 to get the expected first and last key and then call key_range and verify that the results are the same.

Comment on lines +901 to +905
if let Some(range) = &mut self.key_range {
range.extend_to(key);
} else {
self.key_range = Some(KeyRange::from_refs(key, key));
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clones every key we write, which will be expensive for large keys. That's easy but it's not necessary, we have at least two ways to avoid it:

  • The Writer could recover the key range from what it wrote, which is still in memory in Writer1::close and Writer2::close, since it writes the top-level index or data block as the last thing it does there, and then return it from Writer1::close and Writer2::close along with the bloom filter, or from Writer1::into_reader or Writer2::into_reader.
  • Read it back in Reader::new() since it's probably still in the cache (we just wrote it).

}

fn push_key(&mut self, key: &K) {
if let Some(range) = &mut self.key_range {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writer1::write0() asserts that the keys that it's called with are in order.

Comment on lines +21 to +36
impl Add for FilterStats {
type Output = Self;

fn add(mut self, rhs: Self) -> Self::Output {
self.add_assign(rhs);
self
}
}

impl AddAssign for FilterStats {
fn add_assign(&mut self, rhs: Self) {
self.size_byte += rhs.size_byte;
self.hits += rhs.hits;
self.misses += rhs.misses;
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the other day that we use derive_more. I think we could just write #[derive(Add, Sum)] to get these. It's a matter of taste whether you like that though.

(I just noticed this, I think we could use this elsewhere and don't.)

Comment on lines +48 to +49
hits: CachePadded<AtomicUsize>,
misses: CachePadded<AtomicUsize>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The padding makes this structure big, probably 384 bytes?

Do you expect hits and misses to be accessed by different CPUs? If not, they could go in the same CachePadded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants