Conversation
|
Up to 1.18x faster (cycles) on sightglass benchmarks and big wins in terms of cache misses as well as cache accesses in general; about 1.05x faster in terms of instructions retired.
|
|
See #87 for more data on regalloc2's memory allocations. You posted this PR ~15 seconds after I opened the issue, great responsiveness! |
cfallin
left a comment
There was a problem hiding this comment.
LGTM, thanks a ton for doing this work! Just some nits below.
| repository = "https://github.com/bytecodealliance/regalloc2" | ||
|
|
||
| [dependencies] | ||
| arena-btree = { path = "../arena-btree" } |
There was a problem hiding this comment.
should refer to published crate here
| } | ||
|
|
||
| impl<'a, F: Function> Drop for Env<'a, F> { | ||
| fn drop(&mut self) { |
There was a problem hiding this comment.
I don't think we actually need this, given that we know our key and index types are Copy (u32s in fact)?
There was a problem hiding this comment.
(Perhaps in trivial cases the empty drop impl means things optimize away, but I'd be a little skeptical of LLVM's ability to do that here in concert with the use of the drain iters etc)
| } | ||
| } | ||
|
|
||
| pub fn drop(self, arena: &mut Arena<LiveRangeKey, LiveRangeIndex>) { |
There was a problem hiding this comment.
Likewise here, not needed given knowledge of key/value types?
|
Spent some time doing some more benchmarking of this PR and #92. Benchmark Methods
Overall SummaryNeither arena implementation performed better than The vec-based arena is a little slower than With the shuffling allocator -- which makes allocation slower, and therefore can The Default AllocatorSummary:
main.so vs vec-arena.soSummary:
Detailsmain.so vs bumpalo-arena.soSummary:
Detailsvec-arena.so vs bumpalo-arena.soSummary:
DetailsShuffling AllocatorSummary:
main-shuffling.so vs vec-arena-shuffling.soSummary:
Detailsmain-shuffling.so vs bumpalo-arena-shuffling.soSummary:
Detailsvec-arena-shuffling.so vs bumpalo-arena-shuffling.soSummary:
Details |
|
I dug some more into this with DHAT and I'm seeing that the lifetime of the bump chunks inside I verified this by logging free list hits (i.e. we reuse a previously allocated block) vs free list misses (i.e. we had to bump allocate a new block) in the arena free list. I got these numbers on So only about 1/20 allocations use recycled blocks. Takeaways:
|
|
For the hell of it, I experimented with changing the b6.so vs b12.soDetailsb6.so vs b3.soDetailsb6-shuffling.so vs b12-shuffling.soDetailsb6-shuffling.so vs b3-shuffling.soDetails |
|
That's too bad that the tree parameter didn't have more of an effect -- thanks a bunch for looking at this though, as at least now we know! (We've talked offline already but) I agree with your general analysis above: it seems that the use-case we have in RA2 is traversal-heavy, cache-miss-heavy on bigger workloads, but not allocator-heavy, and with little reuse. Given that, maybe it makes sense in hindsight that arena allocation would not have much impact either way here. All of that said, thank you so much for going through the exercise of actually building it and measuring it here -- this has been really valuable exploration. |
FWIW, I did this experiment and it was a wash. No real gains or slow downs. I also did an experiment where I didn't pin to one thread on one core, and allowed cranelift to use all available parallelism. My thought was that maybe there was contention on locks in the malloc implementation, and that therefore by moving to bump/arena allocation we would see speed ups in parallel benchmarks but not necessarily in serial benchmarks. This was also a wash with no real gains or slow downs (even when built with the shuffling allocator, which doesn't have thread local caches and has a single lock for the whole allocator!) So at this point, I don't think it is worth following this line of investigation any further. I think we can conclude that mallocs/frees aren't really a bottleneck for regalloc2. Based on my analysis of regalloc2's malloc usage and the lifetimes of the allocations, regalloc2 isn't thrashing the allocator, and instead just allocates some stuff, does a lot of work on that stuff, and then frees it. The allocation and freeing just doesn't register relative to the actual work. And I believe the work is cache bound, so future efforts should focus on shrinking data structures to fit more of the working set in cache at the same time, making the working set smaller (e.g. less splitting), and general data-oriented engineering of the code base. Closing this PR and the other one. |
Depends on bytecodealliance/arena-btree#1