Consider encoding when deduplicating strings.#5198
Merged
enebo merged 1 commit intojruby:masterfrom Sep 19, 2018
Merged
Conversation
CRuby's fstring cache, used for frozen string deduplication, uses slightly different equalitye logic than the default equality for strings. Specifically, if two strings have the same 7bit ascii bytes, but two different ascii-compatible strings, the strings are still considered to be equal. But for the fstring cache, you can register the same 7-bit string with different ascii-compatible encodings and they both live in the cache. In JRuby, we use a standard JDK collection, ConcurrentHashMap, that always uses the standard equals() method that works like normal String equality as described above. We are forced to use a wrapper, both for storage and for lookup. This patch introduces that wrapper, and also introduces a thread-local caching mechanism to reduce the cost of looking up deduplicated strings in the cache. The additional overhead of the cache is: * The wrapper object and indirecting through it. * Constructing a wrapper object (only when the previous lookup added a new wrapper or this is the first lookup). * Accessing a previously cached wrapper via a thread-local (inverse of the above conditions) In the typical case, where the requested string has already been deduplicated, the system should eventually get to a point where there's no new entries being added, and the cached wrapper is used every time. There may be more overhead at startup to create the wrappers. There may be few calls to lookup a string that do not trigger a new entry, since most language constructs (FrozenString in IR for example) save the result, making the cache perhaps unnecessary.
Member
|
LGTM |
Member
just a random thought - how many single byte ascii-compatible encodings are there to manage? (a few) |
kares
approved these changes
May 26, 2018
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #5190.
CRuby's fstring cache, used for frozen string deduplication, uses
slightly different equalitye logic than the default equality for
strings. Specifically, if two strings have the same 7bit ascii
bytes, but two different ascii-compatible strings, the strings are
still considered to be equal. But for the fstring cache, you can
register the same 7-bit string with different ascii-compatible
encodings and they both live in the cache.
In JRuby, we use a standard JDK collection, ConcurrentHashMap,
that always uses the standard equals() method that works like
normal String equality as described above. We are forced to use
a wrapper, both for storage and for lookup. This patch introduces
that wrapper, and also introduces a thread-local caching mechanism
to reduce the cost of looking up deduplicated strings in the
cache.
The additional overhead of the cache is:
added a new wrapper or this is the first lookup).
(inverse of the above conditions)
In the typical case, where the requested string has already been
deduplicated, the system should eventually get to a point where
there's no new entries being added, and the cached wrapper is used
every time. There may be more overhead at startup to create the
wrappers. There may be few calls to lookup a string that do not
trigger a new entry, since most language constructs (FrozenString
in IR for example) save the result, making the cache perhaps
unnecessary.