Skip to content

JRuby creates symbols with US-ASCII encoding but non-ASCII bytes #4828

@jmiettinen

Description

@jmiettinen

When JRuby creates symbols for undefined local variable, the symbols' ByteList have US-ASCII encoding but bytes in it may not actually be within US-ASCII range.

Environment

Reproduces at least on JRuby 1.7.27 && JRuby 9.1.12.0. Master seems to currently be the same.

This is JRuby internal, so platform is mostly irrevelant. However:

  • uname -a says Darwin jmiettinen.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64
  • locale says LC_ALL="fi_FI.UTF-8"

Expected Behavior

Given this small script (named utf8_fail.rb in my example outputs):

# encoding: utf-8                                                                                                                                                                                                                                                             
begin
  öÖa
rescue => e
  puts e.message.encoding
  puts e.message
  puts /foo/ === e.message
end

I would expect to get the following output (this is from 1.9.3-p448 and 2.3.1):

UTF-8
undefined local variable or method `öÖa' for main:Object
false

Actual Behavior

However, when the same file is run with JRuby 1.7.27 / JRuby 9.1.12.0, we get problems with bytes in the created symbol öÖa:

US-ASCII
undefined local variable or method `??a' for main:Object
ArgumentError: invalid byte sequence in US-ASCII
     === at org/jruby/RubyRegexp.java:1078
  <main> at utf8_fail.rb:8

Here the error message differs and RubyRegexp notices that there are some non-ASCII bytes in the string with US-ASCII encoding and throws ArgumentError.

If we run this through hexdump (ruby utf8_fail.rb 2>&1| hexdump -C), we get

00000000  55 53 2d 41 53 43 49 49  0a 75 6e 64 65 66 69 6e  |US-ASCII.undefin|
00000010  65 64 20 6c 6f 63 61 6c  20 76 61 72 69 61 62 6c  |ed local variabl|
00000020  65 20 6f 72 20 6d 65 74  68 6f 64 20 60 f6 d6 61  |e or method `..a|
00000030  27 20 66 6f 72 20 6d 61  69 6e 3a 4f 62 6a 65 63  |' for main:Objec|
00000040  74 0a 41 72 67 75 6d 65  6e 74 45 72 72 6f 72 3a  |t.ArgumentError:|
00000050  20 69 6e 76 61 6c 69 64  20 62 79 74 65 20 73 65  | invalid byte se|
00000060  71 75 65 6e 63 65 20 69  6e 20 55 53 2d 41 53 43  |quence in US-ASC|
00000070  49 49 0a 20 20 20 20 20  3d 3d 3d 20 61 74 20 6f  |II.     === at o|
00000080  72 67 2f 6a 72 75 62 79  2f 52 75 62 79 52 65 67  |rg/jruby/RubyReg|
00000090  65 78 70 2e 6a 61 76 61  3a 31 30 37 38 0a 20 20  |exp.java:1078.  |
000000a0  3c 6d 61 69 6e 3e 20 61  74 20 75 74 66 38 5f 66  |<main> at utf8_f|
000000b0  61 69 6c 2e 72 62 3a 38  0a                       |ail.rb:8.|
000000b9

Here it can be seen that codepoints for ö and Ö (f6 and d6) are copied just directly to the ByteList used in that message.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions