Conversation
|
some metrics, will try to tune them, I think we can just maintain a percentage of $things-served-by-slab as a single metric per-class |
|
@gz is there a measurable impact in terms of fewer allocations, cpu usage, pipeline throughput? |
|
This is seemingly a lot of mechanism. I hope there is commensurate benefit. My first thought on how to do this is to make the cache thread-local with |
we see jemalloc using a significant amount of CPU cycles using madvise. all of this seems stupid and really jemalloc's problem, so the goal here is to alleviate malloc pressure. I don't expect performance gain form this compared to jemalloc with background threads, but I expect us not having to waste 4 cores doing nothing but madvise
It might be a better strategy to be per-thread, I'll experiment and look at if it's better/worse in terms of avoiding mallocs/frees. I'm not sure how to add metrics and use dev-tweaks to override defaults without having the Runtime be aware of the slabs so will probably need to keep that part. |
There's a (disputed) claim on hackernews that jemalloc wastes a lot of cpu time with madvise because the kernel wastes cpu time zeroing those pages before they circulate right back to the process: https://news.ycombinator.com/item?id=47404107. I don't know whether that helps at all. |
I saw that too, from whatever we see the fb person might be right :) we did experiment with mimalloc and this problem doesn't really show up anymore then; but the downside of switching allocators is we don't get heap profiles which can be very valuable when something goes wrong |
dd9af9d to
ce82929
Compare
We used to get FBufs from (je)malloc and give it back with free after we were done using it. This isn't ideal for two reasons: a) Every time we request a buffer, we zero it b) We see that jemalloc spends a significant amount of time doing madvise calls While a) could be solved with some MaybeUninit tricks, b) is harder because it's jemalloc being opportunistic in informing the OS about memory it no longer needs. We used to tune this with MALLOC_CONF and enabling background threads for jemalloc, but still it seems like a waste to use so many cycles for buffers we will re-use anyways. This change adds a size-class based slab-pool for FBufs, which hopefully significantly reduces any malloc pressure. It also has a nice side-effect that it saves CPU cycles previously wasted zeroing things again and again in case a buffer can be re-used. Most of the added code is for stats. The slab logic itself is quite small. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
ryzhyk
left a comment
There was a problem hiding this comment.
Looks nice, I didn't find anything to complain about.
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
We used to get FBufs from (je)malloc and give it back with free after we were done using it.
This isn't ideal for two reasons:
a) Every time we request a buffer, we zero it
b) We see that jemalloc spends a significant amount of time doing madvise calls
While a) could be solved with some MaybeUninit tricks, b) is harder because it's jemalloc being opportunistic in informing the OS about memory it no longer needs.
We used to tune this with MALLOC_CONF and enabling background threads for jemalloc, but still it seems like a waste to use so many cycles for buffers
we will re-use anyways.
This change adds a size-class based slab-pool
for FBufs, which hopefully significantly reduces
any malloc pressure. It also has a nice side-effect that it saves CPU cycles previously wasted
zeroing things again and again in case a buffer
can be re-used.
Most of the added code is for stats. The slab logic itself is quite small.
Describe Manual Test Plan
Ran a few pipelines and looked at new metrics, will run some more.
Checklist
Breaking Changes?
No breaking changes.
Describe Incompatible Changes
No incompatible changes.