Skip to content

Issue when splitting an encoded string with specific characters #5714

@n00tmeg

Description

@n00tmeg

Environment

$ bin/jruby -v
jruby 9.2.8.0-SNAPSHOT (2.5.3) 2019-04-23 1679826 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [darwin-x86_64]

Expected Behavior

Splitting an encoded string with a null byte delimiter should returns the expected array of strings.
Example script (test.rb):

str1 = "AA\0BB\0CC".encode('utf-16le')
str2 = "\0".encode('utf-16le')
array = str1.split(str2)
puts array.inspect

Expected result (CRuby):

$ ruby test.rb
["AA", "BB", "CC"]

Actual Behavior

JRuby does not properly split the string:

$ jruby test.rb
["", "", "CC"]

The issue is in indexOf() method from RubyString.java (https://github.com/jruby/jruby/blob/master/core/src/main/java/org/jruby/RubyString.java#L4258). This method looks for the index of a specified substring (or character) in a byte array without considering the real size of the encoded characters.

In this example, the byte array related to str1 is:
byte_array => [65, 0, 65, 0, 0, 0, 66, 0, 66, 0, 0, 0, 67, 0, 67, 0]
and the delimiter character (str2) is:
delim => [0, 0]

The first time it is called, indexOf() will match byte_array[3] and byte_array[4] instead of matching byte_array[4] and byte_array[5] and returning 4.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions