-
-
Notifications
You must be signed in to change notification settings - Fork 942
Description
When JRuby creates symbols for undefined local variable, the symbols' ByteList have US-ASCII encoding but bytes in it may not actually be within US-ASCII range.
Environment
Reproduces at least on JRuby 1.7.27 && JRuby 9.1.12.0. Master seems to currently be the same.
This is JRuby internal, so platform is mostly irrevelant. However:
uname -asaysDarwin jmiettinen.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64localesaysLC_ALL="fi_FI.UTF-8"
Expected Behavior
Given this small script (named utf8_fail.rb in my example outputs):
# encoding: utf-8
begin
öÖa
rescue => e
puts e.message.encoding
puts e.message
puts /foo/ === e.message
endI would expect to get the following output (this is from 1.9.3-p448 and 2.3.1):
UTF-8
undefined local variable or method `öÖa' for main:Object
false
Actual Behavior
However, when the same file is run with JRuby 1.7.27 / JRuby 9.1.12.0, we get problems with bytes in the created symbol öÖa:
US-ASCII
undefined local variable or method `??a' for main:Object
ArgumentError: invalid byte sequence in US-ASCII
=== at org/jruby/RubyRegexp.java:1078
<main> at utf8_fail.rb:8
Here the error message differs and RubyRegexp notices that there are some non-ASCII bytes in the string with US-ASCII encoding and throws ArgumentError.
If we run this through hexdump (ruby utf8_fail.rb 2>&1| hexdump -C), we get
00000000 55 53 2d 41 53 43 49 49 0a 75 6e 64 65 66 69 6e |US-ASCII.undefin|
00000010 65 64 20 6c 6f 63 61 6c 20 76 61 72 69 61 62 6c |ed local variabl|
00000020 65 20 6f 72 20 6d 65 74 68 6f 64 20 60 f6 d6 61 |e or method `..a|
00000030 27 20 66 6f 72 20 6d 61 69 6e 3a 4f 62 6a 65 63 |' for main:Objec|
00000040 74 0a 41 72 67 75 6d 65 6e 74 45 72 72 6f 72 3a |t.ArgumentError:|
00000050 20 69 6e 76 61 6c 69 64 20 62 79 74 65 20 73 65 | invalid byte se|
00000060 71 75 65 6e 63 65 20 69 6e 20 55 53 2d 41 53 43 |quence in US-ASC|
00000070 49 49 0a 20 20 20 20 20 3d 3d 3d 20 61 74 20 6f |II. === at o|
00000080 72 67 2f 6a 72 75 62 79 2f 52 75 62 79 52 65 67 |rg/jruby/RubyReg|
00000090 65 78 70 2e 6a 61 76 61 3a 31 30 37 38 0a 20 20 |exp.java:1078. |
000000a0 3c 6d 61 69 6e 3e 20 61 74 20 75 74 66 38 5f 66 |<main> at utf8_f|
000000b0 61 69 6c 2e 72 62 3a 38 0a |ail.rb:8.|
000000b9
Here it can be seen that codepoints for ö and Ö (f6 and d6) are copied just directly to the ByteList used in that message.