Skip to content

Unicode characters are lost when embedding JRuby even if the calling code performs no conversion to byte[] #2403

@hakanai

Description

@hakanai

The following tests fail with -Dfile.encoding=windows-1252 but pass with -Dfile.encoding=UTF-8 :

import java.io.StringWriter;
import java.io.Writer;

import javax.script.ScriptContext;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;

import org.jruby.embed.ScriptingContainer;
import org.junit.Test;

import static org.hamcrest.Matchers.*;
import static org.junit.Assert.*;

public class TestUnicodeCharacters {
    String orig = "\u6625\u304C\u6765\u305F\u3002";
    String scriptlet = "#encoding: UTF-8\n" +
                       "str = \"" + orig + "\"\n" +
                       "puts str\n" +
                       "str\n";
    Writer writer = new StringWriter();

    @Test
    public void testCharacterEncodingViaScriptEngine() throws Exception {
        ScriptEngine engine = new ScriptEngineManager().getEngineByExtension("rb");
        ScriptContext context = engine.getContext();
        context.setWriter(writer);
        String result = (String) engine.eval(scriptlet, context);
        checkValues(result);
    }

    @Test
    public void testCharacterEncodingViaScriptContainer() throws Exception {
        ScriptingContainer container = new ScriptingContainer();
        container.setWriter(writer);
        String result = (String) container.runScriptlet(scriptlet);
        checkValues(result);
    }

    private void checkValues(String returnedResult) {
        assertThat(returnedResult, is(equalTo(orig)));
        assertThat(writer.toString().trim(), is(equalTo(orig)));
    }
}

Most likely, the failure output you get will be confusing as well:

java.lang.AssertionError: 
Expected: is "?????"
     but: was "?????"

The "Expected" line is "?????" because Java is encoding the output as windows-1252.

The "but" line is "?????" because JRuby has encoded the strings to windows-1252 internally and then written and returned the question marks. I find it particularly odd that it would do this, both because the script is passed as a string directly from Java in the first place, but also because the script itself clearly says the strings are UTF-8.

This was JRUBY-4890 on the old tracker. The script had to be updated a bit to have a #encoding: UTF-8 directive because JRuby now complains if you omit it.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions