storage: re-use FBufs across allocation/deallocation by gz · Pull Request #5843 · feldera/feldera

gz · 2026-03-17T06:44:06Z

We used to get FBufs from (je)malloc and give it back with free after we were done using it.

This isn't ideal for two reasons:

a) Every time we request a buffer, we zero it
b) We see that jemalloc spends a significant amount of time doing madvise calls

While a) could be solved with some MaybeUninit tricks, b) is harder because it's jemalloc being opportunistic in informing the OS about memory it no longer needs.

We used to tune this with MALLOC_CONF and enabling background threads for jemalloc, but still it seems like a waste to use so many cycles for buffers
we will re-use anyways.

This change adds a size-class based slab-pool
for FBufs, which hopefully significantly reduces
any malloc pressure. It also has a nice side-effect that it saves CPU cycles previously wasted
zeroing things again and again in case a buffer
can be re-used.

Most of the added code is for stats. The slab logic itself is quite small.

Describe Manual Test Plan

Ran a few pipelines and looked at new metrics, will run some more.

Checklist

Unit tests added/updated
Integration tests added/updated
Documentation updated
Changelog updated

Breaking Changes?

No breaking changes.

Describe Incompatible Changes

No incompatible changes.

gz · 2026-03-17T07:12:23Z

some metrics, will try to tune them, I think we can just maintain a percentage of $things-served-by-slab as a single metric per-class

[("pipeline_name", "pipeline"), ("size_class_bytes", "1024")]
storage_fbuf_slab_size_class_alloc_requests_total:         9715.00
storage_fbuf_slab_size_class_free_fallbacks_total:            0.00
storage_fbuf_slab_size_class_frees_saved_total:        11637.00
storage_fbuf_slab_size_class_mallocs_saved_total:         9689.00

[("pipeline_name", "pipeline"), ("size_class_bytes", "16384")]
storage_fbuf_slab_size_class_alloc_requests_total:       801012.00
storage_fbuf_slab_size_class_free_fallbacks_total:        22801.00
storage_fbuf_slab_size_class_frees_saved_total:       704663.00
storage_fbuf_slab_size_class_mallocs_saved_total:       703974.00

[("pipeline_name", "pipeline"), ("size_class_bytes", "2048")]
storage_fbuf_slab_size_class_alloc_requests_total:        10986.00
storage_fbuf_slab_size_class_free_fallbacks_total:            0.00
storage_fbuf_slab_size_class_frees_saved_total:        11982.00
storage_fbuf_slab_size_class_mallocs_saved_total:        10849.00

[("pipeline_name", "pipeline"), ("size_class_bytes", "32768")]
storage_fbuf_slab_size_class_alloc_requests_total:           49.00
storage_fbuf_slab_size_class_free_fallbacks_total:            0.00
storage_fbuf_slab_size_class_frees_saved_total:            2.00
storage_fbuf_slab_size_class_mallocs_saved_total:            1.00

[("pipeline_name", "pipeline"), ("size_class_bytes", "4096")]
storage_fbuf_slab_size_class_alloc_requests_total:     37303172.00
storage_fbuf_slab_size_class_free_fallbacks_total:      3445829.00
storage_fbuf_slab_size_class_frees_saved_total:     37272596.00
storage_fbuf_slab_size_class_mallocs_saved_total:     37268507.00

[("pipeline_name", "pipeline"), ("size_class_bytes", "512")]
storage_fbuf_slab_size_class_alloc_requests_total:         9398.00
storage_fbuf_slab_size_class_free_fallbacks_total:            0.00
storage_fbuf_slab_size_class_frees_saved_total:        11077.00
storage_fbuf_slab_size_class_mallocs_saved_total:         9379.00

[("pipeline_name", "pipeline"), ("size_class_bytes", "65536")]
storage_fbuf_slab_size_class_alloc_requests_total:           48.00
storage_fbuf_slab_size_class_free_fallbacks_total:            0.00
storage_fbuf_slab_size_class_frees_saved_total:            0.00
storage_fbuf_slab_size_class_mallocs_saved_total:            0.00

[("pipeline_name", "pipeline"), ("size_class_bytes", "8192")]
storage_fbuf_slab_size_class_alloc_requests_total:     70890244.00
storage_fbuf_slab_size_class_free_fallbacks_total:      3773457.00
storage_fbuf_slab_size_class_frees_saved_total:     63522642.00
storage_fbuf_slab_size_class_mallocs_saved_total:     63522136.00

lalithsuresh · 2026-03-17T12:46:10Z

@gz is there a measurable impact in terms of fewer allocations, cpu usage, pipeline throughput?

blp · 2026-03-17T16:18:14Z

This is seemingly a lot of mechanism. I hope there is commensurate benefit.

My first thought on how to do this is to make the cache thread-local with std::thread_local!. Then every thread would have its own cache and there would be no need to have anything special in the runtime. (Stats maybe would be harder.) That could be simpler than what is here.

crates/adapters/src/controller.rs

crates/storage/src/fbuf/slab.rs

gz · 2026-03-17T23:00:21Z

This is seemingly a lot of mechanism. I hope there is commensurate benefit.

we see jemalloc using a significant amount of CPU cycles using madvise.
we can configure jemalloc to use background threads at which points it offloads this work and things get faster (in the workload where this problem is prominent we're seeing reduction of ~1h backfill time)
however, it then uses 4 threads that do nothing else but call madvise

all of this seems stupid and really jemalloc's problem, so the goal here is to alleviate malloc pressure. I don't expect performance gain form this compared to jemalloc with background threads, but I expect us not having to waste 4 cores doing nothing but madvise

My first thought on how to do this is to make the cache thread-local with std::thread_local

It might be a better strategy to be per-thread, I'll experiment and look at if it's better/worse in terms of avoiding mallocs/frees. I'm not sure how to add metrics and use dev-tweaks to override defaults without having the Runtime be aware of the slabs so will probably need to keep that part.

blp · 2026-03-17T23:05:17Z

This is seemingly a lot of mechanism. I hope there is commensurate benefit.

we see jemalloc using a significant amount of CPU cycles using madvise. we can configure jemalloc to use background threads at which points it offloads this work and things get faster (in the workload where this problem is prominent we're seeing reduction of ~1h backfill time) however, it then uses 4 threads that do nothing else but call madvise

There's a (disputed) claim on hackernews that jemalloc wastes a lot of cpu time with madvise because the kernel wastes cpu time zeroing those pages before they circulate right back to the process: https://news.ycombinator.com/item?id=47404107. I don't know whether that helps at all.

blp

This is sane, @gz I'll just assume that you'll only merge it if it ultimately makes sense as a balance between performance and understandable code.

gz · 2026-03-17T23:14:35Z

There's a (disputed) claim on hackernews that jemalloc wastes a lot of cpu time with madvise because the kernel wastes cpu time zeroing those pages before they circulate right back to the process

I saw that too, from whatever we see the fb person might be right :)

we did experiment with mimalloc and this problem doesn't really show up anymore then; but the downside of switching allocators is we don't get heap profiles which can be very valuable when something goes wrong

We used to get FBufs from (je)malloc and give it back with free after we were done using it. This isn't ideal for two reasons: a) Every time we request a buffer, we zero it b) We see that jemalloc spends a significant amount of time doing madvise calls While a) could be solved with some MaybeUninit tricks, b) is harder because it's jemalloc being opportunistic in informing the OS about memory it no longer needs. We used to tune this with MALLOC_CONF and enabling background threads for jemalloc, but still it seems like a waste to use so many cycles for buffers we will re-use anyways. This change adds a size-class based slab-pool for FBufs, which hopefully significantly reduces any malloc pressure. It also has a nice side-effect that it saves CPU cycles previously wasted zeroing things again and again in case a buffer can be re-used. Most of the added code is for stats. The slab logic itself is quite small. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

ryzhyk

Looks nice, I didn't find anything to complain about.

crates/storage/src/fbuf/slab.rs

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

gz force-pushed the slabs branch from e9eb7e0 to 44314a2 Compare March 17, 2026 06:59

gz force-pushed the slabs branch from 44314a2 to b642c59 Compare March 17, 2026 07:21

gz requested review from blp and ryzhyk March 17, 2026 15:47

mythical-fred approved these changes Mar 17, 2026

View reviewed changes

crates/adapters/src/controller.rs Outdated Show resolved Hide resolved

crates/storage/src/fbuf/slab.rs Show resolved Hide resolved

blp approved these changes Mar 17, 2026

View reviewed changes

gz force-pushed the slabs branch 3 times, most recently from dd9af9d to ce82929 Compare March 19, 2026 04:23

gz force-pushed the slabs branch from ce82929 to 863098f Compare March 19, 2026 07:27

ryzhyk approved these changes Mar 19, 2026

View reviewed changes

crates/storage/src/fbuf/slab.rs Show resolved Hide resolved

gz added this pull request to the merge queue Mar 19, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 20, 2026

gz added this pull request to the merge queue Mar 20, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 20, 2026

ci: increase stack size for java tests

c3d3f86

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

gz enabled auto-merge March 20, 2026 23:50

gz added this pull request to the merge queue Mar 20, 2026

Merged via the queue into main with commit 39aa0bf Mar 21, 2026
1 check passed

gz deleted the slabs branch March 21, 2026 03:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: re-use FBufs across allocation/deallocation#5843

storage: re-use FBufs across allocation/deallocation#5843
gz merged 2 commits intomainfrom
slabs

gz commented Mar 17, 2026

Uh oh!

gz commented Mar 17, 2026

Uh oh!

lalithsuresh commented Mar 17, 2026

Uh oh!

blp commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

gz commented Mar 17, 2026 •

edited

Loading

Uh oh!

blp commented Mar 17, 2026

Uh oh!

blp left a comment

Uh oh!

gz commented Mar 17, 2026 •

edited

Loading

Uh oh!

ryzhyk left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

gz commented Mar 17, 2026

Describe Manual Test Plan

Checklist

Breaking Changes?

Describe Incompatible Changes

Uh oh!

gz commented Mar 17, 2026

Uh oh!

lalithsuresh commented Mar 17, 2026

Uh oh!

blp commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

gz commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blp commented Mar 17, 2026

Uh oh!

blp left a comment

Choose a reason for hiding this comment

Uh oh!

gz commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryzhyk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gz commented Mar 17, 2026 •

edited

Loading

gz commented Mar 17, 2026 •

edited

Loading