Skip to content

String#byteslice can raise an inappropriate ArrayIndexOutOfBoundsException #886

@jrochkind

Description

@jrochkind

...or return the wrong byte. What reproduces this for me is a split on a UTF-8 encoded string including control characters (which are legal UTF-8). Which appears to make jruby's internal string representation confused about byte lengths and offsets of the string's internal buffer.

Again, the result can be a wrong result from String#byteslice, and/or an inappropriate nonsensical ArrayIndexOutOfBoundsException being raised. In my reproducible test case below, it's reproducing the ArrayIndexOutOfBounds.

Encoding and byte count issues are really hard to talk about, rather than try to explain in words I'll explain with a failing Test::Unit, annotated.

This works in MRI ((ruby 1.9.3p448 (2013-06-27 revision 41675) [x86_64-darwin12.4.0])), but raises a really weird exception you'll see in jruby (jruby 1.7.4 (1.9.3p392) 2013-05-16 2390d3b on Java HotSpot(TM) 64-Bit Server VM 1.6.0_51-b11-457-11M4509 [darwin-x86_64])

require 'test/unit'


# jruby 1.7.4 (1.9.3p392) 2013-05-16 2390d3b on Java HotSpot(TM) 64-Bit Server VM 1.6.0_51-b11-457-11M4509 [darwin-x86_64]
class TestField < Test::Unit::TestCase

  def test_confused_bytecount





    string_with_ctrl = "hello\x1fhello".force_encoding("UTF-8")
    # control chars like \x1F ARE legal UTF-8, this is correct:
    assert string_with_ctrl.valid_encoding?

    # It's even considered ascii_only? -- this is correct, both MRI and jruby
    assert string_with_ctrl.ascii_only?


    # For reasons I can't explain, I can only reproduce the 
    # problem right now by doing a split, on the control char
    # (this does represent my actual use case)
    # Whether the split operand is tagged ASCII or UTF-8 does not matter,
    # case is identical either way. 
    elements = string_with_ctrl.split("\x1F".force_encoding("UTF-8"))  

    # For some reason weirdness only happens on the second one in the split
    # in this case. 
    second = elements[1]


    # For a string composed of all one-byte wide ascii, as this one is...
    assert_equal "hello", second
    assert second.ascii_only?

    # string[0] and string.byteslice(0) shoudl be identical. They are
    # different when the string contains multi-byte chars. 
    # using #[], we're okay
    assert_equal "h", second[0]

    # But on jruby, this following actually raises an exception!
    assert_equal "h", second.byteslice(0)
    # That one up there actually just raised!!!
    # Java::JavaLang::ArrayIndexOutOfBoundsException: 12
    #  org.jruby.util.ByteList.equal(ByteList.java:960)

    # In other cases I saw in my real app, it didn't raise, but
    # did return the WRONG bytes. Ie, not a 'h' above as expected, or
    # not:


    assert_equal second[0], second.byteslice(0)
    # but in jruby we never even get here, we raise. 

    # In MRI, we pass ALL these tests with no exceptions. 
    # (ruby 1.9.3p448 (2013-06-27 revision 41675) [x86_64-darwin12.4.0])
  end

end

(interested parties include @billdueber and @AdamJ)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions