Skip to content

storage: re-use FBufs across allocation/deallocation#5843

Merged
gz merged 2 commits intomainfrom
slabs
Mar 21, 2026
Merged

storage: re-use FBufs across allocation/deallocation#5843
gz merged 2 commits intomainfrom
slabs

Conversation

@gz
Copy link
Copy Markdown
Contributor

@gz gz commented Mar 17, 2026

We used to get FBufs from (je)malloc and give it back with free after we were done using it.

This isn't ideal for two reasons:

a) Every time we request a buffer, we zero it
b) We see that jemalloc spends a significant amount of time doing madvise calls

While a) could be solved with some MaybeUninit tricks, b) is harder because it's jemalloc being opportunistic in informing the OS about memory it no longer needs.

We used to tune this with MALLOC_CONF and enabling background threads for jemalloc, but still it seems like a waste to use so many cycles for buffers
we will re-use anyways.

This change adds a size-class based slab-pool
for FBufs, which hopefully significantly reduces
any malloc pressure. It also has a nice side-effect that it saves CPU cycles previously wasted
zeroing things again and again in case a buffer
can be re-used.

Most of the added code is for stats. The slab logic itself is quite small.

Describe Manual Test Plan

Ran a few pipelines and looked at new metrics, will run some more.

Checklist

  • Unit tests added/updated
  • Integration tests added/updated
  • Documentation updated
  • Changelog updated

Breaking Changes?

No breaking changes.

Describe Incompatible Changes

No incompatible changes.

@gz
Copy link
Copy Markdown
Contributor Author

gz commented Mar 17, 2026

some metrics, will try to tune them, I think we can just maintain a percentage of $things-served-by-slab as a single metric per-class

[("pipeline_name", "pipeline"), ("size_class_bytes", "1024")]
storage_fbuf_slab_size_class_alloc_requests_total:         9715.00
storage_fbuf_slab_size_class_free_fallbacks_total:            0.00
storage_fbuf_slab_size_class_frees_saved_total:        11637.00
storage_fbuf_slab_size_class_mallocs_saved_total:         9689.00

[("pipeline_name", "pipeline"), ("size_class_bytes", "16384")]
storage_fbuf_slab_size_class_alloc_requests_total:       801012.00
storage_fbuf_slab_size_class_free_fallbacks_total:        22801.00
storage_fbuf_slab_size_class_frees_saved_total:       704663.00
storage_fbuf_slab_size_class_mallocs_saved_total:       703974.00

[("pipeline_name", "pipeline"), ("size_class_bytes", "2048")]
storage_fbuf_slab_size_class_alloc_requests_total:        10986.00
storage_fbuf_slab_size_class_free_fallbacks_total:            0.00
storage_fbuf_slab_size_class_frees_saved_total:        11982.00
storage_fbuf_slab_size_class_mallocs_saved_total:        10849.00

[("pipeline_name", "pipeline"), ("size_class_bytes", "32768")]
storage_fbuf_slab_size_class_alloc_requests_total:           49.00
storage_fbuf_slab_size_class_free_fallbacks_total:            0.00
storage_fbuf_slab_size_class_frees_saved_total:            2.00
storage_fbuf_slab_size_class_mallocs_saved_total:            1.00

[("pipeline_name", "pipeline"), ("size_class_bytes", "4096")]
storage_fbuf_slab_size_class_alloc_requests_total:     37303172.00
storage_fbuf_slab_size_class_free_fallbacks_total:      3445829.00
storage_fbuf_slab_size_class_frees_saved_total:     37272596.00
storage_fbuf_slab_size_class_mallocs_saved_total:     37268507.00

[("pipeline_name", "pipeline"), ("size_class_bytes", "512")]
storage_fbuf_slab_size_class_alloc_requests_total:         9398.00
storage_fbuf_slab_size_class_free_fallbacks_total:            0.00
storage_fbuf_slab_size_class_frees_saved_total:        11077.00
storage_fbuf_slab_size_class_mallocs_saved_total:         9379.00

[("pipeline_name", "pipeline"), ("size_class_bytes", "65536")]
storage_fbuf_slab_size_class_alloc_requests_total:           48.00
storage_fbuf_slab_size_class_free_fallbacks_total:            0.00
storage_fbuf_slab_size_class_frees_saved_total:            0.00
storage_fbuf_slab_size_class_mallocs_saved_total:            0.00

[("pipeline_name", "pipeline"), ("size_class_bytes", "8192")]
storage_fbuf_slab_size_class_alloc_requests_total:     70890244.00
storage_fbuf_slab_size_class_free_fallbacks_total:      3773457.00
storage_fbuf_slab_size_class_frees_saved_total:     63522642.00
storage_fbuf_slab_size_class_mallocs_saved_total:     63522136.00

@lalithsuresh
Copy link
Copy Markdown
Contributor

@gz is there a measurable impact in terms of fewer allocations, cpu usage, pipeline throughput?

@gz gz requested review from blp and ryzhyk March 17, 2026 15:47
@blp
Copy link
Copy Markdown
Member

blp commented Mar 17, 2026

This is seemingly a lot of mechanism. I hope there is commensurate benefit.

My first thought on how to do this is to make the cache thread-local with std::thread_local!. Then every thread would have its own cache and there would be no need to have anything special in the runtime. (Stats maybe would be harder.) That could be simpler than what is here.

@gz
Copy link
Copy Markdown
Contributor Author

gz commented Mar 17, 2026

This is seemingly a lot of mechanism. I hope there is commensurate benefit.

we see jemalloc using a significant amount of CPU cycles using madvise.
we can configure jemalloc to use background threads at which points it offloads this work and things get faster (in the workload where this problem is prominent we're seeing reduction of ~1h backfill time)
however, it then uses 4 threads that do nothing else but call madvise

all of this seems stupid and really jemalloc's problem, so the goal here is to alleviate malloc pressure. I don't expect performance gain form this compared to jemalloc with background threads, but I expect us not having to waste 4 cores doing nothing but madvise

My first thought on how to do this is to make the cache thread-local with std::thread_local

It might be a better strategy to be per-thread, I'll experiment and look at if it's better/worse in terms of avoiding mallocs/frees. I'm not sure how to add metrics and use dev-tweaks to override defaults without having the Runtime be aware of the slabs so will probably need to keep that part.

@blp
Copy link
Copy Markdown
Member

blp commented Mar 17, 2026

This is seemingly a lot of mechanism. I hope there is commensurate benefit.

we see jemalloc using a significant amount of CPU cycles using madvise. we can configure jemalloc to use background threads at which points it offloads this work and things get faster (in the workload where this problem is prominent we're seeing reduction of ~1h backfill time) however, it then uses 4 threads that do nothing else but call madvise

There's a (disputed) claim on hackernews that jemalloc wastes a lot of cpu time with madvise because the kernel wastes cpu time zeroing those pages before they circulate right back to the process: https://news.ycombinator.com/item?id=47404107. I don't know whether that helps at all.

Copy link
Copy Markdown
Member

@blp blp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sane, @gz I'll just assume that you'll only merge it if it ultimately makes sense as a balance between performance and understandable code.

@gz
Copy link
Copy Markdown
Contributor Author

gz commented Mar 17, 2026

There's a (disputed) claim on hackernews that jemalloc wastes a lot of cpu time with madvise because the kernel wastes cpu time zeroing those pages before they circulate right back to the process

I saw that too, from whatever we see the fb person might be right :)

we did experiment with mimalloc and this problem doesn't really show up anymore then; but the downside of switching allocators is we don't get heap profiles which can be very valuable when something goes wrong

@gz gz force-pushed the slabs branch 3 times, most recently from dd9af9d to ce82929 Compare March 19, 2026 04:23
We used to get FBufs from (je)malloc and give it back
with free after we were done using it.

This isn't ideal for two reasons:

a) Every time we request a buffer, we zero it
b) We see that jemalloc spends a significant amount
of time doing madvise calls

While a) could be solved with some MaybeUninit tricks,
b) is harder because it's jemalloc being opportunistic
in informing the OS about memory it no longer needs.

We used to tune this with MALLOC_CONF and enabling
background threads for jemalloc, but still it seems
like a waste to use so many cycles for buffers
we will re-use anyways.

This change adds a size-class based slab-pool
for FBufs, which hopefully significantly reduces
any malloc pressure. It also has a nice side-effect
that it saves CPU cycles previously wasted
zeroing things again and again in case a buffer
can be re-used.

Most of the added code is for stats. The slab logic
itself is quite small.

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Copy link
Copy Markdown
Contributor

@ryzhyk ryzhyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice, I didn't find anything to complain about.

@gz gz added this pull request to the merge queue Mar 19, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 20, 2026
@gz gz added this pull request to the merge queue Mar 20, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 20, 2026
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
@gz gz enabled auto-merge March 20, 2026 23:50
@gz gz added this pull request to the merge queue Mar 20, 2026
Merged via the queue into main with commit 39aa0bf Mar 21, 2026
1 check passed
@gz gz deleted the slabs branch March 21, 2026 03:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants