Unicode characters are lost when embedding JRuby even if the calling code performs no conversion to byte[]

The following tests fail with -Dfile.encoding=windows-1252 but pass with -Dfile.encoding=UTF-8 :

import java.io.StringWriter;
    import java.io.Writer;

```
import javax.script.ScriptContext;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;

import org.jruby.embed.ScriptingContainer;
import org.junit.Test;

import static org.hamcrest.Matchers.*;
import static org.junit.Assert.*;

public class TestUnicodeCharacters {
    String orig = "\u6625\u304C\u6765\u305F\u3002";
    String scriptlet = "#encoding: UTF-8\n" +
                       "str = \"" + orig + "\"\n" +
                       "puts str\n" +
                       "str\n";
    Writer writer = new StringWriter();

    @Test
    public void testCharacterEncodingViaScriptEngine() throws Exception {
        ScriptEngine engine = new ScriptEngineManager().getEngineByExtension("rb");
        ScriptContext context = engine.getContext();
        context.setWriter(writer);
        String result = (String) engine.eval(scriptlet, context);
        checkValues(result);
    }

    @Test
    public void testCharacterEncodingViaScriptContainer() throws Exception {
        ScriptingContainer container = new ScriptingContainer();
        container.setWriter(writer);
        String result = (String) container.runScriptlet(scriptlet);
        checkValues(result);
    }

    private void checkValues(String returnedResult) {
        assertThat(returnedResult, is(equalTo(orig)));
        assertThat(writer.toString().trim(), is(equalTo(orig)));
    }
}
```

Most likely, the failure output you get will be confusing as well:

```
java.lang.AssertionError: 
Expected: is "?????"
     but: was "?????"
```

The "Expected" line is "?????" because Java is encoding the output as windows-1252.

The "but" line is "?????" because JRuby has encoded the strings to windows-1252 internally and then written and returned the question marks. I find it particularly odd that it would do this, both because the script is passed as a string directly from Java in the first place, but also because the script itself clearly says the strings are UTF-8.

This was JRUBY-4890 on the old tracker. The script had to be updated a bit to have a `#encoding: UTF-8` directive because JRuby now complains if you omit it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unicode characters are lost when embedding JRuby even if the calling code performs no conversion to byte[] #2403

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Unicode characters are lost when embedding JRuby even if the calling code performs no conversion to byte[] #2403

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions