Consistent hash code values between JVM instances#590
Consistent hash code values between JVM instances#590rdblue wants to merge 3 commits intojruby:masterfrom
Conversation
Several of the core JRuby classes calculate hash codes based on java or ruby object ids. This doesn't produce consistent hashing across JVM instances, which is needed for distributed frameworks. For example, Hadoop uses hashCode values to distribute keys from the map phase to the same reducer task (partitioning). This commit adds hashCode (and ruby's hash method) implementations for RubyBoolean, RubyNil, and RubySymbol. RubyBoolean and RubyNil simply return static, randomly-generated hashCode values that are hard-coded. This replaces the default java Object#hashCode. For RubySymbol, the previous implementation of hashCode returned the symbol's id, which could be different depending on the order in which symbols are created. This updates it to calculate a hashCode based on the raw symbolBytes like the RubyString implementation, but with a RubySymbol-specific seed and without the encoding addition for 1.9. This value is calculated when symbols are instantiated so the performance impact should be minimal. This commit also adds a RubyInstanceConfig setting and CLI option for consistent hashing, jruby.consistent.hashing.enabled, which controls whether the Ruby runtime's hash seeds (k0 and k1) are generated randomly. When set to true, they are set to static values. These hash seeds are used to hash RubyString objects, so this will make string hash codes consistent across JVMs.
|
Ahh, interesting. I assume when the default hash setting is active, you're not concerned about the Hash-based DOS everyone got upset about last year... Will look into this; having default hash values for booleans, nil, and symbols seems mostly reasonable. |
|
No, I'm not concerned with a DOS since i'm using it for Hadoop. The reference to SecureRandom is what made me add the option and default it to "off". Seems like a situation where you either need one or the other. |
|
Ok, so I have a couple concerns. nil, true, and false hashcodes were explicitly changed to be random in MRI 1.9ish, so us forcing them to be specific values concerns me. It is at least a visible behavior change, and at most a deviation that could break something (that doesn't seem likely, but I don't like behavioral differences). I think the best plan here would be to also link up nil, true, and false to the option you added, so we can basically just turn on "predictable hashing" for all these types at once. Out of curiousity, why do you need nil, true, and false to have consistent hashcodes? Using them as keys seems like a bad idea. |
|
Nil sneaks in whenever a value doesn't exist--for example, when a CSV row doesn't have a column--and we try to put the value into a compound key. True and false happen less often, but sometimes you don't want to serialize a large value from mapper to reducer and all you need to do is check it for some property, so you do the check map-side and encode it as a boolean. I've seen nils in practice quite a bit, booleans less (or maybe not at all, I don't remember). It just seems like a good idea to ensure everything works consistently. I'll add hashCode fields on nil, true, and false and default them randomly based on the consistent hashing option. Why did MRI move to random values explicitly? I can understand not wanting predictable strings, but you don't use true for a hash key very often and it always collides with itself anyway. |
Per discussion on the last commit's pull request [1], updating the implementations of hashCode for RubyNil and RubyBoolean. Now the hashCode behavior for nil and booleans will only change when consistent hashing is enabled. Adds a hashCode instance variable to RubyBoolean and RubyNil that is set in the constructor to the Object#hashCode value (using System.identityHashCode) or a static value. [1]: jruby#590
|
Everything should be enabled by the consistent hashing option now. Are there any other problems with this pull request that I can fix? |
Annotated methods on RubyBoolean were not being added to the ruby class, just the static methods in RubyBoolean.True and RubyBoolean.False. Now hash is actually defined on the ruby TrueClass and FalseClass.
|
I ended up finding a bug: RubyBoolean's annotated methods were not being added because there weren't any previously. The above commit fixes the problem. |
|
Can you squash this into a single commit? We can then merge it for 1.7.4. |
|
I squashed the commits in a new branch (shouldn't have been using master) and added a new pull request: I'll close this one. Thanks! |
|
FWIW, you can just force-push your squashed branch and the PR would pick it up. But this is fine for now :-) |
|
Ah, good to know. I was on my way to work and wanted to just get something to you. Thanks! |
Several of the core JRuby classes calculate hash codes based on java or ruby object ids. This doesn't produce consistent hashing across JVM instances, which is needed for distributed frameworks. For example, Hadoop uses hashCode values to distribute keys from the map phase to the same reducer task (partitioning). This commit adds hashCode (and ruby's hash method) implementations for RubyBoolean, RubyNil, and RubySymbol. RubyBoolean and RubyNil simply return static, randomly-generated hashCode values that are hard-coded. This replaces the default java Object#hashCode. For RubySymbol, the previous implementation of hashCode returned the symbol's id, which could be different depending on the order in which symbols are created. This updates it to calculate a hashCode based on the raw symbolBytes like the RubyString implementation, but with a RubySymbol-specific seed and without the encoding addition for 1.9. This value is calculated when symbols are instantiated so the performance impact should be minimal. This commit also adds a RubyInstanceConfig setting and CLI option for consistent hashing, jruby.consistent.hashing.enabled, which controls whether the Ruby runtime's hash seeds (k0 and k1) are generated randomly. When set to true, they are set to static values. These hash seeds are used to hash RubyString objects, so this will make string hash codes consistent across JVMs. (later commit...) Updating hashCode implementations. Per discussion on the last commit's pull request [1], updating the implementations of hashCode for RubyNil and RubyBoolean. Now the hashCode behavior for nil and booleans will only change when consistent hashing is enabled. Adds a hashCode instance variable to RubyBoolean and RubyNil that is set in the constructor to the Object#hashCode value (using System.identityHashCode) or a static value. [1]: jruby#590
Hash codes for symbols, booleans, nil, and (sometimes) strings are not consistent between JVM instances. This means that JRuby objects can't be used in frameworks like Hadoop, which uses consistent hashing for synchronization. Hadoop, specifically, uses object hash codes to determine which reduce task is responsible for a particular key. If the hash codes differ, then map tasks send keys to different reduce tasks and there are duplicates in the result.
This adds or updates hashCode (and hash) implementations for booleans, nil, and symbols. It also adds an option to use default hash seed values, rather than random ones for consistent string hashes. Some tests are included, where it was obvious where to put them.
Please let me know if I did something wrong so I can fix it and send another pull request.