Skip to content

Encoding experiments with file path negotiation#6249

Merged
headius merged 4 commits intojruby:masterfrom
headius:make_expand_path_dumber
May 29, 2020
Merged

Encoding experiments with file path negotiation#6249
headius merged 4 commits intojruby:masterfrom
headius:make_expand_path_dumber

Conversation

@headius
Copy link
Member

@headius headius commented May 29, 2020

This PR holds some experiments in making file path negotiation (expand_path, fnmatch) work properly with strings that are encoded as "binary" or "ASCII-8BIT".

In these cases it appears we should largely ignore the encoding of the characters and treat it as raw bytes, since it's not possible for us to guess the encoding. However some of the logic in these methods uses Java strings, and therefore needs to be decoded into UTF-16 characters. We will need to balance the binary cases with the properly-encoded cases as best we can, or otherwise rewrite this logic altogether to use only the raw byte form.

@headius
Copy link
Member Author

headius commented May 29, 2020

The first fixes here relate to #6246 in the failing "CJK" test that depends on treating a local file path and a remote URI path as binary and still locate the file in question.

My changes treat binary or ASCII-8BIT paths provided to File.expand_path as ISO-8859-1 bytes, eliminating any chance for their multibyte characters to be improperly decoded. The first commit does this treatment to all paths passed to expand_path, and the second narrows it to just the cases where one or both incoming paths are already marked as ASCII-8BIT. This fixes part of the failing CJK test by allowing the expand_path negotiation to proceed without mangling.

headius added 2 commits May 29, 2020 15:58
In order to be able to expand paths that are marked as binary, we
need to be able to treat those paths as raw bytes. This change
attempts to decode those paths as ISO-8859-1 bytes, allowing the
path expansion to ignore any multibyte characters rather than
improperly decoding them. The final encoding is used to put the
ISO-8859-1 bytes back into their 8-bit form and re-mark them as
the eventually negotiated encoding.

This fixes part of the WEBrick "cjk" test failure by allowing the
binary-encoded multibyte path to be expanded without mangling the
MBC character.

This change is limited to when the incoming strings are explicitly
marked as "binary" using the ASCII-8BIT encoding, since that case
clearly has no encoding hint for us to use to get characters.
There may be other cases, like CR_UNKNOWN or CR_BROKEN, that also
deserve this treatment, but for now the change is limited to
explicitly binary strings.

See jruby#6171 for the general WEBRick bug and jruby#6246 for the PR that
attempts to fix failures.
ISO is not descriptive for anyone unfamiliar with ISO-8859-1
encoding, and "raw" more clearly conveys that the content will
be decoded as raw single-byte character content.
@headius headius force-pushed the make_expand_path_dumber branch from 5a7a4d9 to 9eef789 Compare May 29, 2020 21:18
Our current logic for dealing with file existence is intertwined
with a lot of JDK file APIs, so we don't have the option here
of just using a raw-encoded string. This change attempts to use
the system default encoding when the incoming path is specified as
"binary" so that hopefully its characters will be decoded
properly.

This fixes cases where a properly-encoded multibyte path is
marked as binary, as is the case in the WEBrick file-serving
pipeline described in jruby#6246.
@headius headius force-pushed the make_expand_path_dumber branch from 678afc3 to e6e03ae Compare May 29, 2020 22:03
This attempts to centralize the acquisition of a decoded path
string into a central location that does all of the following:

* Extracts already-decoded paths from IO objects
* Uses get_path logic to acquire a path string from other objects
* Checks for embedded nulls
* Normalizes the path on Windows based on the current directory
* Decodes binary paths as though encoded like filesystem paths
@headius
Copy link
Member Author

headius commented May 29, 2020

After force-pushing I have committed to the narrowed binary-only change and also renamed the confusing "decodeISO" logic to "decodeRaw" to more clearly indicate that it's decoding raw byte content.

@headius headius added this to the JRuby 9.3.0.0 milestone May 29, 2020
@headius
Copy link
Member Author

headius commented May 29, 2020

The additional fixes deal with places where we use Java strings for paths and can't easily avoid doing so, because those strings feed into JDK APIs that expect properly-decoded characters. When we have such a string in hand, marked as binary, this logic will attempt to decode it as the default system encoding. This may not be correct, if the string is both marked as binary and has a different encoding from the system, but when dealing with file paths it seems a reasonable assumption. It will at least be correct more often than treating the path as ISO0-8859-1 bytes when it may actually have multibyte characters.

This fixes a large number of APIs that previous just decoded binary strings as ISO-8859-1:

  • Dir entries and globbing
  • File test methods like directory?
  • File-opening APIs like File.new

The logic has largely been centralized in RubyFile.getAdjustedPath, which does all of the following:

  • Extracts an already-decoded path if the given argument is an IO object
  • Coerces a path using the get_path logic for other object types
  • Checks for null bytes
  • Handles adjusting the CWD and windows path elements
  • Decodes binary strings as though they are system-encoded (the fix described above)

@headius headius merged commit 1165085 into jruby:master May 29, 2020
@headius headius deleted the make_expand_path_dumber branch May 29, 2020 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant