Skip to content

Parse surrogates in string literals properly#5629

Merged
coolreader18 merged 3 commits into
RustPython:mainfrom
coolreader18:surrogate-literals
Mar 27, 2025
Merged

Parse surrogates in string literals properly#5629
coolreader18 merged 3 commits into
RustPython:mainfrom
coolreader18:surrogate-literals

Conversation

@coolreader18

Copy link
Copy Markdown
Member

Because ruff represents string literals as str, we have to reparse every string that might have a surrogate \uDXXX escape as wtf8. Then there were a bunch of crashes to fix, because surrogate strings in test/ actually get parsed properly now instead of being subbed with REPLACEMENT_CHARACTER.

@youknowone youknowone requested a review from Copilot March 27, 2025 04:29
@youknowone

Copy link
Copy Markdown
Member

interesting, copilot reviewer is added. I tried a test run on this pr

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR aims to improve the handling of surrogate escapes in string literals by reparsing them into WTF‑8 format so that previously crashing surrogate strings now produce valid output. Key changes include:

  • Adding surrogate-specific parsing logic in the string parser and updating usage of string constants to use WTF‑8.
  • Adjusting code in the WTF‑8 module to introduce new types for lead and trail surrogates and improve decoding.
  • Updating various tests and Cargo.toml dependencies to align with these changes.

Reviewed Changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated no comments.

File Description
compiler/codegen/src/string_parser.rs Implements surrogate-aware parsing for string literals.
common/src/wtf8/mod.rs Introduces new surrogate types and adds decoding enhancements.
Lib/test/*.py and other test files Updates expected failure markers and related test expectations.
compiler/core/Cargo.toml & compiler/codegen/Cargo.toml Adds and adjusts dependencies required for the changes.
Comments suppressed due to low confidence (3)

compiler/codegen/src/string_parser.rs:280

  • Using unwrap_or_else with an empty match in parse_string_literal may mask unexpected errors during surrogate parsing. Consider handling the error explicitly with a descriptive panic message or returning a proper Result.
        .unwrap_or_else(|x| match x {})

common/src/wtf8/mod.rs:767

  • In the from_bytes function, always advancing by 3 bytes after decoding a surrogate sequence may be fragile if the assumptions about surrogate lengths change. Consider adding a comment to document this behavior or validating the boundary before slicing.
            rest = &rest[3..];

compiler/codegen/src/compile.rs:2682

  • [nitpick] The code branch that reparses string literals containing the replacement character introduces additional processing. Verify that this extra reparsing is required for proper surrogate handling and that the performance impact is acceptable.
if value.contains(char::REPLACEMENT_CHARACTER) {

@coolreader18

Copy link
Copy Markdown
Member Author

I guess I'm reviewing the review, lol

This PR aims to improve the handling of surrogate escapes in string literals by reparsing them into WTF‑8 format so that previously crashing surrogate strings now produce valid output

Close, but reparsing the strings to WTF-8 actually began causing crashes. The second commit then fixes those crashes.

The following are "low confidence" but I guess that's for good reason

Using unwrap_or_else with an empty match in parse_string_literal may mask unexpected errors during surrogate parsing. Consider handling the error explicitly with a descriptive panic message or returning a proper Result.

        .unwrap_or_else(|x| match x {})

Not how uninhabited types work lol

In the from_bytes function, always advancing by 3 bytes after decoding a surrogate sequence may be fragile if the assumptions about surrogate lengths change. Consider adding a comment to document this behavior or validating the boundary before slicing.

If UTF-8 changes, we have more issues than just this.

[nitpick] The code branch that reparses string literals containing the replacement character introduces additional processing. Verify that this extra reparsing is required for proper surrogate handling and that the performance impact is acceptable.

Well, yes, that's why it's behind a .contains(REPLACEMENT_CHAR) branch in the first place.

Comment thread common/src/wtf8/mod.rs Outdated
Comment thread vm/src/stdlib/codecs.rs
@@ -26,7 +26,7 @@ mod _codecs {
fn lookup(encoding: PyStrRef, vm: &VirtualMachine) -> PyResult {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we need PyUtf8StrRef to avoid the repeating try_to_str(vm) too much

@@ -0,0 +1,287 @@
//! A stripped-down version of ruff's string literal parser, modified to

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would ruff parser also need to be patched?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, though I'm not sure whether they'd necessarily feel a need to. I think what might strike a good balance is if there's a flag that can be set like contains_surrogates, so we can know whether or not we need to reparse.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a question for @MichaReiser.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably. There's an open issue but it isn't something that has come up often. astral-sh/ruff#13666

Co-authored-by: Jeong, YunWon <69878+youknowone@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants