Encoding experiments with file path negotiation#6249
Conversation
|
The first fixes here relate to #6246 in the failing "CJK" test that depends on treating a local file path and a remote URI path as binary and still locate the file in question. My changes treat binary or ASCII-8BIT paths provided to |
In order to be able to expand paths that are marked as binary, we need to be able to treat those paths as raw bytes. This change attempts to decode those paths as ISO-8859-1 bytes, allowing the path expansion to ignore any multibyte characters rather than improperly decoding them. The final encoding is used to put the ISO-8859-1 bytes back into their 8-bit form and re-mark them as the eventually negotiated encoding. This fixes part of the WEBrick "cjk" test failure by allowing the binary-encoded multibyte path to be expanded without mangling the MBC character. This change is limited to when the incoming strings are explicitly marked as "binary" using the ASCII-8BIT encoding, since that case clearly has no encoding hint for us to use to get characters. There may be other cases, like CR_UNKNOWN or CR_BROKEN, that also deserve this treatment, but for now the change is limited to explicitly binary strings. See jruby#6171 for the general WEBRick bug and jruby#6246 for the PR that attempts to fix failures.
ISO is not descriptive for anyone unfamiliar with ISO-8859-1 encoding, and "raw" more clearly conveys that the content will be decoded as raw single-byte character content.
5a7a4d9 to
9eef789
Compare
Our current logic for dealing with file existence is intertwined with a lot of JDK file APIs, so we don't have the option here of just using a raw-encoded string. This change attempts to use the system default encoding when the incoming path is specified as "binary" so that hopefully its characters will be decoded properly. This fixes cases where a properly-encoded multibyte path is marked as binary, as is the case in the WEBrick file-serving pipeline described in jruby#6246.
678afc3 to
e6e03ae
Compare
This attempts to centralize the acquisition of a decoded path string into a central location that does all of the following: * Extracts already-decoded paths from IO objects * Uses get_path logic to acquire a path string from other objects * Checks for embedded nulls * Normalizes the path on Windows based on the current directory * Decodes binary paths as though encoded like filesystem paths
|
After force-pushing I have committed to the narrowed binary-only change and also renamed the confusing "decodeISO" logic to "decodeRaw" to more clearly indicate that it's decoding raw byte content. |
|
The additional fixes deal with places where we use Java strings for paths and can't easily avoid doing so, because those strings feed into JDK APIs that expect properly-decoded characters. When we have such a string in hand, marked as binary, this logic will attempt to decode it as the default system encoding. This may not be correct, if the string is both marked as binary and has a different encoding from the system, but when dealing with file paths it seems a reasonable assumption. It will at least be correct more often than treating the path as ISO0-8859-1 bytes when it may actually have multibyte characters. This fixes a large number of APIs that previous just decoded binary strings as ISO-8859-1:
The logic has largely been centralized in RubyFile.getAdjustedPath, which does all of the following:
|
This PR holds some experiments in making file path negotiation (expand_path, fnmatch) work properly with strings that are encoded as "binary" or "ASCII-8BIT".
In these cases it appears we should largely ignore the encoding of the characters and treat it as raw bytes, since it's not possible for us to guess the encoding. However some of the logic in these methods uses Java strings, and therefore needs to be decoded into UTF-16 characters. We will need to balance the binary cases with the properly-encoded cases as best we can, or otherwise rewrite this logic altogether to use only the raw byte form.