JRuby creates symbols with US-ASCII encoding but non-ASCII bytes

When JRuby creates symbols for undefined local variable, the symbols' `ByteList` have `US-ASCII` encoding but bytes in it may not actually be within `US-ASCII` range.

### Environment

Reproduces at least on JRuby 1.7.27 && JRuby 9.1.12.0. Master seems to [currently be the same](https://github.com/jruby/jruby/blob/6fe0e5a1a93b006ee41764e2b7c1d7fc72514a90/core/src/main/java/org/jruby/RubySymbol.java#L661).

This is JRuby internal, so platform is mostly irrevelant. However:
* `uname -a` says `Darwin jmiettinen.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64`
* `locale` says `LC_ALL="fi_FI.UTF-8"`
### Expected Behavior

Given this small script (named `utf8_fail.rb` in my example outputs):
```ruby
# encoding: utf-8                                                                                                                                                                                                                                                             
begin
  öÖa
rescue => e
  puts e.message.encoding
  puts e.message
  puts /foo/ === e.message
end
```

I would expect to get the following output (this is from 1.9.3-p448 and 2.3.1):
```
UTF-8
undefined local variable or method `öÖa' for main:Object
false
```

### Actual Behavior
However, when the same file is run with JRuby 1.7.27 / JRuby 9.1.12.0, we get problems with bytes in the created symbol öÖa:
```
US-ASCII
undefined local variable or method `??a' for main:Object
ArgumentError: invalid byte sequence in US-ASCII
     === at org/jruby/RubyRegexp.java:1078
  <main> at utf8_fail.rb:8
```

Here the error message differs and `RubyRegexp` notices that there are some non-ASCII bytes in the string with `US-ASCII` encoding and throws `ArgumentError`.

If we run this through hexdump (`ruby utf8_fail.rb 2>&1| hexdump -C`), we get
```
00000000  55 53 2d 41 53 43 49 49  0a 75 6e 64 65 66 69 6e  |US-ASCII.undefin|
00000010  65 64 20 6c 6f 63 61 6c  20 76 61 72 69 61 62 6c  |ed local variabl|
00000020  65 20 6f 72 20 6d 65 74  68 6f 64 20 60 f6 d6 61  |e or method `..a|
00000030  27 20 66 6f 72 20 6d 61  69 6e 3a 4f 62 6a 65 63  |' for main:Objec|
00000040  74 0a 41 72 67 75 6d 65  6e 74 45 72 72 6f 72 3a  |t.ArgumentError:|
00000050  20 69 6e 76 61 6c 69 64  20 62 79 74 65 20 73 65  | invalid byte se|
00000060  71 75 65 6e 63 65 20 69  6e 20 55 53 2d 41 53 43  |quence in US-ASC|
00000070  49 49 0a 20 20 20 20 20  3d 3d 3d 20 61 74 20 6f  |II.     === at o|
00000080  72 67 2f 6a 72 75 62 79  2f 52 75 62 79 52 65 67  |rg/jruby/RubyReg|
00000090  65 78 70 2e 6a 61 76 61  3a 31 30 37 38 0a 20 20  |exp.java:1078.  |
000000a0  3c 6d 61 69 6e 3e 20 61  74 20 75 74 66 38 5f 66  |<main> at utf8_f|
000000b0  61 69 6c 2e 72 62 3a 38  0a                       |ail.rb:8.|
000000b9
```

Here it can be seen that codepoints for ö and Ö (f6 and d6) are copied just directly to the `ByteList` used in that message.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JRuby creates symbols with US-ASCII encoding but non-ASCII bytes #4828

Environment

Expected Behavior

Actual Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

JRuby creates symbols with US-ASCII encoding but non-ASCII bytes #4828

Description

Environment

Expected Behavior

Actual Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions