Merged
Conversation
a253915 to
513f869
Compare
Profiling repeated high-ancestor method definitions revealed much
slower performance for all that invalidation when using indy and
SwitchPoint, around 4-6x slower than the simple generation-based
invalidator. Much of the overhead is in building the invalidator
list, SwitchPoint array, and invalidating those SwitchPoints over
and over.
This patch makes the following improvements:
* Guess at the right size for invalidator list based on previous
invalidation events. We use last size * 1.25 to give room for a
bit of growth, because reallocation along this path was the top
alloc in benchmark.
* Only add SwitchPointInvalidator to list if it has an actively-
used SwitchPoint. When invalidated, we do not immediately create
a new SwitchPoint, instead replacing the just-invalidated SP with
an invalid dummy SP. Only when the SP is directly requested do we
populate it with a live SP. Dummy switchpoints do not need to be
re-invalidated, so we avoid adding to the list.
* Avoid allocating zero-length SwitchPoint lists when no SP are
in use.
* Avoid allocating iterators along non-SP invalidation paths.
* Tweaks to the dummy logic to actully use the dummy whenever we
invalidate or prepare for invalidation.
Given the following benchmark:
```ruby
require 'benchmark'
loop {
puts Benchmark.measure {
10000.times {
module Kernel
def foo1 = nil
def foo2 = nil
def foo3 = nil
def foo4 = nil
def foo5 = nil
end
}
}
}
```
Performance goes from:
```
7.450000 0.140000 7.590000 ( 6.643895)
6.590000 0.060000 6.650000 ( 6.532706)
6.570000 0.070000 6.640000 ( 6.555341)
```
to:
```
1.510000 0.040000 1.550000 ( 1.165226)
1.140000 0.010000 1.150000 ( 1.074266)
1.050000 0.000000 1.050000 ( 1.034040)
```
Compared with non-indy:
```
1.840000 0.050000 1.890000 ( 1.138749)
1.220000 0.000000 1.220000 ( 1.064950)
1.020000 0.010000 1.030000 ( 1.013060)
```
Note that this optimization is most effective when few call sites
are being populated, such as early in boot when most invalidation
takes place. Heavy root invalidations at runtime while call sites
are active throughout the subhierarchy will still suffer from SP
churn and excess allocation of supporting structures. A move to
aggregate SP invalidation at the call site (gather all parent SP
as call site guard) will push the cost into initial and re-binds
of those sites, which can be expected to either stabilize or
give up eventually and use simple caching.
513f869 to
0663077
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
JRuby 10 enables invokedynamic for all optimization by default. Unfortunately, some of the plumbing around invokedynamic use has never seen much in the way of profiling and optimization. This can impact early boot times for applications and early execution performance and warmup.
This PR will be a pass over the key parts of indy infrastructure with a goal of "harm reduction" when boot-time and early runtime common cases hit this plumbing heavily.
Optimizations included: