Encoding experiments with file path negotiation by headius · Pull Request #6249 · jruby/jruby

headius · 2020-05-29T12:45:19Z

This PR holds some experiments in making file path negotiation (expand_path, fnmatch) work properly with strings that are encoded as "binary" or "ASCII-8BIT".

In these cases it appears we should largely ignore the encoding of the characters and treat it as raw bytes, since it's not possible for us to guess the encoding. However some of the logic in these methods uses Java strings, and therefore needs to be decoded into UTF-16 characters. We will need to balance the binary cases with the properly-encoded cases as best we can, or otherwise rewrite this logic altogether to use only the raw byte form.

headius · 2020-05-29T13:10:52Z

The first fixes here relate to #6246 in the failing "CJK" test that depends on treating a local file path and a remote URI path as binary and still locate the file in question.

My changes treat binary or ASCII-8BIT paths provided to File.expand_path as ISO-8859-1 bytes, eliminating any chance for their multibyte characters to be improperly decoded. The first commit does this treatment to all paths passed to expand_path, and the second narrows it to just the cases where one or both incoming paths are already marked as ASCII-8BIT. This fixes part of the failing CJK test by allowing the expand_path negotiation to proceed without mangling.

In order to be able to expand paths that are marked as binary, we need to be able to treat those paths as raw bytes. This change attempts to decode those paths as ISO-8859-1 bytes, allowing the path expansion to ignore any multibyte characters rather than improperly decoding them. The final encoding is used to put the ISO-8859-1 bytes back into their 8-bit form and re-mark them as the eventually negotiated encoding. This fixes part of the WEBrick "cjk" test failure by allowing the binary-encoded multibyte path to be expanded without mangling the MBC character. This change is limited to when the incoming strings are explicitly marked as "binary" using the ASCII-8BIT encoding, since that case clearly has no encoding hint for us to use to get characters. There may be other cases, like CR_UNKNOWN or CR_BROKEN, that also deserve this treatment, but for now the change is limited to explicitly binary strings. See jruby#6171 for the general WEBRick bug and jruby#6246 for the PR that attempts to fix failures.

ISO is not descriptive for anyone unfamiliar with ISO-8859-1 encoding, and "raw" more clearly conveys that the content will be decoded as raw single-byte character content.

Our current logic for dealing with file existence is intertwined with a lot of JDK file APIs, so we don't have the option here of just using a raw-encoded string. This change attempts to use the system default encoding when the incoming path is specified as "binary" so that hopefully its characters will be decoded properly. This fixes cases where a properly-encoded multibyte path is marked as binary, as is the case in the WEBrick file-serving pipeline described in jruby#6246.

This attempts to centralize the acquisition of a decoded path string into a central location that does all of the following: * Extracts already-decoded paths from IO objects * Uses get_path logic to acquire a path string from other objects * Checks for embedded nulls * Normalizes the path on Windows based on the current directory * Decodes binary paths as though encoded like filesystem paths

headius · 2020-05-29T22:34:47Z

After force-pushing I have committed to the narrowed binary-only change and also renamed the confusing "decodeISO" logic to "decodeRaw" to more clearly indicate that it's decoding raw byte content.

headius · 2020-05-29T23:17:31Z

The additional fixes deal with places where we use Java strings for paths and can't easily avoid doing so, because those strings feed into JDK APIs that expect properly-decoded characters. When we have such a string in hand, marked as binary, this logic will attempt to decode it as the default system encoding. This may not be correct, if the string is both marked as binary and has a different encoding from the system, but when dealing with file paths it seems a reasonable assumption. It will at least be correct more often than treating the path as ISO0-8859-1 bytes when it may actually have multibyte characters.

This fixes a large number of APIs that previous just decoded binary strings as ISO-8859-1:

Dir entries and globbing
File test methods like directory?
File-opening APIs like File.new

The logic has largely been centralized in RubyFile.getAdjustedPath, which does all of the following:

Extracts an already-decoded path if the given argument is an IO object
Coerces a path using the get_path logic for other object types
Checks for null bytes
Handles adjusting the CWD and windows path elements
Decodes binary strings as though they are system-encoded (the fix described above)

headius added 2 commits May 29, 2020 15:58

Rename and document decodeISO to decodeRaw

9eef789

ISO is not descriptive for anyone unfamiliar with ISO-8859-1 encoding, and "raw" more clearly conveys that the content will be decoded as raw single-byte character content.

headius force-pushed the make_expand_path_dumber branch from 5a7a4d9 to 9eef789 Compare May 29, 2020 21:18

headius force-pushed the make_expand_path_dumber branch from 678afc3 to e6e03ae Compare May 29, 2020 22:03

headius added this to the JRuby 9.3.0.0 milestone May 29, 2020

headius merged commit 1165085 into jruby:master May 29, 2020

headius deleted the make_expand_path_dumber branch May 29, 2020 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Encoding experiments with file path negotiation#6249

Encoding experiments with file path negotiation#6249
headius merged 4 commits intojruby:masterfrom
headius:make_expand_path_dumber

headius commented May 29, 2020

Uh oh!

headius commented May 29, 2020

Uh oh!

headius commented May 29, 2020

Uh oh!

headius commented May 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

headius commented May 29, 2020

Uh oh!

headius commented May 29, 2020

Uh oh!

headius commented May 29, 2020

Uh oh!

headius commented May 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant