Skip to content

Consider encoding when deduplicating strings.#5198

Merged
enebo merged 1 commit intojruby:masterfrom
headius:fstring_considers_encoding
Sep 19, 2018
Merged

Consider encoding when deduplicating strings.#5198
enebo merged 1 commit intojruby:masterfrom
headius:fstring_considers_encoding

Conversation

@headius
Copy link
Member

@headius headius commented May 26, 2018

Fixes #5190.

CRuby's fstring cache, used for frozen string deduplication, uses
slightly different equalitye logic than the default equality for
strings. Specifically, if two strings have the same 7bit ascii
bytes, but two different ascii-compatible strings, the strings are
still considered to be equal. But for the fstring cache, you can
register the same 7-bit string with different ascii-compatible
encodings and they both live in the cache.

In JRuby, we use a standard JDK collection, ConcurrentHashMap,
that always uses the standard equals() method that works like
normal String equality as described above. We are forced to use
a wrapper, both for storage and for lookup. This patch introduces
that wrapper, and also introduces a thread-local caching mechanism
to reduce the cost of looking up deduplicated strings in the
cache.

The additional overhead of the cache is:

  • The wrapper object and indirecting through it.
  • Constructing a wrapper object (only when the previous lookup
    added a new wrapper or this is the first lookup).
  • Accessing a previously cached wrapper via a thread-local
    (inverse of the above conditions)

In the typical case, where the requested string has already been
deduplicated, the system should eventually get to a point where
there's no new entries being added, and the cached wrapper is used
every time. There may be more overhead at startup to create the
wrappers. There may be few calls to lookup a string that do not
trigger a new entry, since most language constructs (FrozenString
in IR for example) save the result, making the cache perhaps
unnecessary.

CRuby's fstring cache, used for frozen string deduplication, uses
slightly different equalitye logic than the default equality for
strings. Specifically, if two strings have the same 7bit ascii
bytes, but two different ascii-compatible strings, the strings are
still considered to be equal. But for the fstring cache, you can
register the same 7-bit string with different ascii-compatible
encodings and they both live in the cache.

In JRuby, we use a standard JDK collection, ConcurrentHashMap,
that always uses the standard equals() method that works like
normal String equality as described above. We are forced to use
a wrapper, both for storage and for lookup. This patch introduces
that wrapper, and also introduces a thread-local caching mechanism
to reduce the cost of looking up deduplicated strings in the
cache.

The additional overhead of the cache is:

* The wrapper object and indirecting through it.
* Constructing a wrapper object (only when the previous lookup
  added a new wrapper or this is the first lookup).
* Accessing a previously cached wrapper via a thread-local
  (inverse of the above conditions)

In the typical case, where the requested string has already been
deduplicated, the system should eventually get to a point where
there's no new entries being added, and the cached wrapper is used
every time. There may be more overhead at startup to create the
wrappers. There may be few calls to lookup a string that do not
trigger a new entry, since most language constructs (FrozenString
in IR for example) save the result, making the cache perhaps
unnecessary.
@headius headius added this to the JRuby 9.2.1.0 milestone May 26, 2018
@headius headius requested review from enebo and kares May 26, 2018 01:21
@headius headius changed the title Consider encoding when deduplicating strings. Fixes #5190. Consider encoding when deduplicating strings. May 26, 2018
@kares
Copy link
Member

kares commented May 26, 2018

LGTM

@kares
Copy link
Member

kares commented May 26, 2018

But for the fstring cache, you can register the same 7-bit string with different ascii-compatible encodings and they both live in the cache.

just a random thought - how many single byte ascii-compatible encodings are there to manage? (a few)
so what about separate stores for encodings ... guess it wouldn't be as performant due double lookup?

@enebo enebo merged commit e7ab40f into jruby:master Sep 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants