Fix `isalpha`/`isalnum` and regex `\w` to use Unicode General Category instead of Alphabetic derived property by Copilot · Pull Request #7520 · RustPython/RustPython

Copilot · 2026-03-27T16:36:32Z

Rust's char::is_alphabetic()/char::is_alphanumeric() use the Unicode Alphabetic derived property, which is a superset of the Unicode letter categories and includes some non-spacing marks (Mn). CPython's str.isalpha() and str.isalnum() are defined strictly by General Category membership (letters: Lu/Ll/Lt/Lm/Lo; numbers: Nd/Nl/No), causing divergence for characters like U+0345 (COMBINING GREEK YPOGEGRAMMENI, category Mn).

"\u0345".isalnum()   # RustPython: True  /  CPython: False
re.match(r"\w", "\u0345")  # RustPython: match  /  CPython: None

Changes

crates/vm/src/builtins/str.rs: Replace char::is_alphabetic() / char::is_alphanumeric() in isalpha() and isalnum() with explicit GeneralCategory checks (L* for isalpha; L* + Nd/Nl/No for isalnum). When GeneralCategory returns Unassigned (i.e. the character is unknown to unic-ucd-category's Unicode 10.0.0 database), fall back to char::is_alphabetic() / char::is_alphanumeric() so that letters added in Unicode 11+ (e.g. U+0560, U+0588, U+05EF, U+0870+) are correctly recognised.
crates/sre_engine/src/string.rs: Apply the same fix to is_uni_alnum() (used for \w in regex); remove the pre-existing // TODO: check with cpython note.
crates/sre_engine/Cargo.toml: Add unic-ucd-category workspace dependency.
extra_tests/snippets/builtin_str_unicode.py: Add regression assertions for U+0345 across isalpha, isalnum, and re.match(r"\w", ...).
extra_tests/snippets/builtin_str.py: Add regression assertions for letter codepoints from Unicode 11+ ([1376, 1416, 1519, 2160–2166]) that were previously misclassified as non-alphabetic due to the stale Unicode 10.0.0 table in unic-ucd-category 0.9.0.

Original prompt

Problem Details

The regex module in RustPython is treating \w as matching more characters than CPython does. The disagreement stems from the isalnum() function, leading to incorrect matching behavior by \w. For example:
import re

assert not re.match(r"\w", "\u0345"), r"\w should not match U+0345 (category Mn)"
Expected Behavior

The regex pattern \w should only match characters for which isalnum() returns True as per the Python documentation. The assertion in the example should pass without throwing an error when behavior matches CPython standards.

In CPython:

"\u0345".isalnum() returns False

Thus, \w in regex does not match \u0345

Current Behavior

In RustPython, "\u0345".isalnum() returns True. Consequently, the regex pattern \w incorrectly matches \u0345 in violation of CPython standards.

Solution

Fix the implementation of isalnum() to align RustPython with CPython's behavior for Unicode character classification, ensuring regex patterns behave accordingly.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Create a pr to fix it

Assistant: A pull request has been initiated to address the issue with inspect.getsource returning incomplete source code. Let me know if further assistance is required!

User: Accepted Confirmation: Are you sure?

Assistant: The pull request to fix the issue with inspect.getsource has been initiated. The Copilot coding agent is now working on this. You can track the progress of this task in the repository. Let me know if you have further instructions!

User: Create a pr to fix this

Assistant: The pull request to resolve the Unicode character discrepancy in the regex module has been initiated, and the Copilot coding agent is actively working on it. Let me know if you need further updates!

This pull request was created from Copilot chat.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

… \\w for Mn characters Agent-Logs-Url: https://github.com/RustPython/RustPython/sessions/6f573a91-8811-486c-933d-7ba9a9067643 Co-authored-by: youknowone <69878+youknowone@users.noreply.github.com>

youknowone · 2026-03-27T23:38:55Z

@copilot fix the ci failure

Agent-Logs-Url: https://github.com/RustPython/RustPython/sessions/18cc7216-15cf-40d1-8726-23e21aa4c368 Co-authored-by: youknowone <69878+youknowone@users.noreply.github.com>

Copilot · 2026-03-27T23:47:18Z

@copilot fix the ci failure

Fixed in fab1c0c. The lint failure was caused by import re being placed after other statements instead of being grouped and sorted with import unicodedata. The ruff isort pre-commit hook (--select I) flagged it. Moved import re to line 14 alongside import unicodedata in alphabetical order.

sunmy2019 · 2026-03-28T09:42:34Z

unic-ucd-category are unmaintained. icu_properties are current recommended successor.
rustsec/advisory-db#2424 (comment)

~~Can you let copilot to update?~~

I see unic-ucd is already in the repo. No worries for now.

youknowone · 2026-03-28T13:22:01Z

@sunmy2019 Thank you for notifying that. Could you please open an issue about that? Let's try if we can move on

sunmy2019 · 2026-03-28T15:53:49Z

Could you please open an issue about that?

I opened #7529.

I run an A/B test on [chr(i) for i in range(0x110000)] with CPython (3.11, I think it's ok here) and this PR.

isalpha: 6663 (out of 1114112) mismatches.
you can test with chr(i) for i in [1376, 1416, 1519, 2160, 2161, 2162, 2163, 2164, 2165, 2166]

isalnum: 6952 (out of 1114112) mismatches.
you can test with chr(i) for i in [1376, 1416, 1519, 2160, 2161, 2162, 2163, 2164, 2165, 2166]

youknowone · 2026-03-29T04:38:13Z

@copilot Check this comment: #7520 (comment)
Add those tests under extra_tests/snippets/builtin_str.py

@sunmy2019 Thanks!

…regression tests Agent-Logs-Url: https://github.com/RustPython/RustPython/sessions/43f9f7dc-f5af-48b8-b93b-6363bcda334c Co-authored-by: youknowone <69878+youknowone@users.noreply.github.com>

Copilot · 2026-03-29T04:53:18Z

@copilot Check this comment: #7520 (comment)
Add those tests under extra_tests/snippets/builtin_str.py

...

Done in b9cbd51. Added regression assertions for codepoints [1376, 1416, 1519, 2160–2166] in extra_tests/snippets/builtin_str.py.

To make those tests pass, I also fixed the underlying issue: unic-ucd-category 0.9.0 uses Unicode 10.0.0 and classifies these characters (U+0560, U+0588, U+05EF, U+0870+, all letters added in Unicode 11+) as Unassigned. The fix adds a fallback to char::is_alphabetic() / char::is_alphanumeric() (Rust std, Unicode 15+) when GeneralCategory returns Unassigned. Characters like U+0345 (Mn/NonspacingMark) are known to Unicode 10 so they still correctly return False.

sunmy2019 · 2026-03-29T06:58:03Z

Interesting. Unicode version does matter here.

Python 3.14 uses Unicode 16.0.0 and Rust 1.92.0/Python 3.15 uses Unicode 17.0.0.

Now I ran tests between CPython 3.15 and this PR.
I run an A/B test on [chr(i) for i in range(0x110000)].

isalpha: 160 (out of 1114112) mismatches:

chr(i) for i in [2199, 3076, 3315, 3457, 6847, 6848, 6860, 6861, 6862, 7410, 7411, 43263, 68900, 68901, 68902, 68903, 68969, 69291, 69292, 69370, 69371, 69372, 69747, 69748, 69826, 69957, 69958, 70094, 70095, 70209, 70584, 70585, 70586, 70587, 70588, 70589, 70590, 70591, 70592, 70594, 70597, 70599, 70600, 70601, 70602, 70604, 70605, 71724, 71725, 71726, 71727, 71728, 71729, 71730, 71731, 71732, 71733, 71734, 71735, 71736, 71984, 71985, 71986, 71987, 71988, 71989, 71991, 71992, 71995, 71996, 72000, 72002, 72145, 72146, 72147, 72148, 72149, 72150, 72151, 72154, 72155, 72156, 72157, 72158, 72159, 72164, 72544, 72545, 72546, 72547, 72548, 72549, 72550, 72551, 73098, 73099, 73100, 73101, 73102, 73104, 73105, 73107, 73108, 73109, 73110, 73459, 73460, 73461, 73462, 73472, 73473, 73475, 73524, 73525, 73526, 73527, 73528, 73529, 73530, 73534, 73535, 73536, 90398, 90399, 90400, 90401, 90402, 90403, 90404, 90405, 90406, 90407, 90408, 90409, 90410, 90411, 90412, 90413, 90414, 94031, 94079, 94080, 94081, 94082, 94083, 94084, 94085, 94086, 94087, 94192, 94193, 94196, 94197, 94198, 123023, 124643, 124646, 124654, 124655, 124661]

isalnum: 157 (out of 1114112) mismatches:

chr(i) for i in [2199, 3076, 3315, 3457, 6847, 6848, 6860, 6861, 6862, 7410, 7411, 43263, 68900, 68901, 68902, 68903, 68969, 69291, 69292, 69370, 69371, 69372, 69747, 69748, 69826, 69957, 69958, 70094, 70095, 70209, 70584, 70585, 70586, 70587, 70588, 70589, 70590, 70591, 70592, 70594, 70597, 70599, 70600, 70601, 70602, 70604, 70605, 71724, 71725, 71726, 71727, 71728, 71729, 71730, 71731, 71732, 71733, 71734, 71735, 71736, 71984, 71985, 71986, 71987, 71988, 71989, 71991, 71992, 71995, 71996, 72000, 72002, 72145, 72146, 72147, 72148, 72149, 72150, 72151, 72154, 72155, 72156, 72157, 72158, 72159, 72164, 72544, 72545, 72546, 72547, 72548, 72549, 72550, 72551, 73098, 73099, 73100, 73101, 73102, 73104, 73105, 73107, 73108, 73109, 73110, 73459, 73460, 73461, 73462, 73472, 73473, 73475, 73524, 73525, 73526, 73527, 73528, 73529, 73530, 73534, 73535, 73536, 90398, 90399, 90400, 90401, 90402, 90403, 90404, 90405, 90406, 90407, 90408, 90409, 90410, 90411, 90412, 90413, 90414, 94031, 94079, 94080, 94081, 94082, 94083, 94084, 94085, 94086, 94087, 94192, 94193, 123023, 124643, 124646, 124654, 124655, 124661]

The reason is that Rust's char::is_alphabetic/is_alphanumeric is not equivalent to Python's definition (by General Category).

For example, \u0d81 is alphabetic but its in Nonspacing_Mark Category. So this will cause a mismatch.
https://util.unicode.org/UnicodeJsps/character.jsp?a=0d81

I think what we needed here is a newer rust library, providing correct unicode category with Unicode 17.0.0. If unic-ucd-category were still maintained, a simple update should have fixed this.

This also affects stdlib unicodedata.

youknowone · 2026-03-29T12:55:30Z

@sunmy2019 Thank you for the good analysis!

Initial plan

2722bc0

Copilot AI assigned Copilot and youknowone Mar 27, 2026

Copilot started work on behalf of youknowone March 27, 2026 16:36 View session

Fix isalnum/isalpha to use Unicode general category checks; fix regex…

5dd88ee

… \\w for Mn characters Agent-Logs-Url: https://github.com/RustPython/RustPython/sessions/6f573a91-8811-486c-933d-7ba9a9067643 Co-authored-by: youknowone <69878+youknowone@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix isalnum function to align with CPython standards~~ Fix isalpha/isalnum and regex \w to use Unicode General Category instead of Alphabetic derived property Mar 27, 2026

Copilot finished work on behalf of youknowone March 27, 2026 16:59

Copilot AI requested a review from youknowone March 27, 2026 16:59

Copilot started work on behalf of youknowone March 27, 2026 23:39 View session

Fix import ordering: move import re to be sorted with import unicodedata

fab1c0c

Agent-Logs-Url: https://github.com/RustPython/RustPython/sessions/18cc7216-15cf-40d1-8726-23e21aa4c368 Co-authored-by: youknowone <69878+youknowone@users.noreply.github.com>

Copilot finished work on behalf of youknowone March 27, 2026 23:47

youknowone marked this pull request as ready for review March 28, 2026 03:48

Copilot started work on behalf of youknowone March 29, 2026 04:38 View session

Fix isalpha/isalnum fallback for chars unassigned in Unicode 10; add …

b9cbd51

…regression tests Agent-Logs-Url: https://github.com/RustPython/RustPython/sessions/43f9f7dc-f5af-48b8-b93b-6363bcda334c Co-authored-by: youknowone <69878+youknowone@users.noreply.github.com>

Copilot finished work on behalf of youknowone March 29, 2026 04:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `isalpha`/`isalnum` and regex `\w` to use Unicode General Category instead of Alphabetic derived property#7520

Fix `isalpha`/`isalnum` and regex `\w` to use Unicode General Category instead of Alphabetic derived property#7520
Copilot wants to merge 4 commits intomainfrom
copilot/fix-isalnum-function-behavior

Copilot AI commented Mar 27, 2026 •

edited

Loading

Uh oh!

youknowone commented Mar 27, 2026

Uh oh!

Copilot AI commented Mar 27, 2026

Uh oh!

sunmy2019 commented Mar 28, 2026 •

edited

Loading

Uh oh!

youknowone commented Mar 28, 2026

Uh oh!

sunmy2019 commented Mar 28, 2026 •

edited

Loading

Uh oh!

youknowone commented Mar 29, 2026

Uh oh!

Copilot AI commented Mar 29, 2026

Uh oh!

sunmy2019 commented Mar 29, 2026

Uh oh!

youknowone commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Problem Details

Expected Behavior

Current Behavior

Solution

Uh oh!

youknowone commented Mar 27, 2026

Uh oh!

Copilot AI commented Mar 27, 2026

Uh oh!

sunmy2019 commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youknowone commented Mar 28, 2026

Uh oh!

sunmy2019 commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youknowone commented Mar 29, 2026

Uh oh!

Copilot AI commented Mar 29, 2026

Uh oh!

sunmy2019 commented Mar 29, 2026

Uh oh!

youknowone commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 27, 2026 •

edited

Loading

sunmy2019 commented Mar 28, 2026 •

edited

Loading

sunmy2019 commented Mar 28, 2026 •

edited

Loading