Fix isalpha/isalnum and regex \w to use Unicode General Category instead of Alphabetic derived property#7520
Fix isalpha/isalnum and regex \w to use Unicode General Category instead of Alphabetic derived property#7520
isalpha/isalnum and regex \w to use Unicode General Category instead of Alphabetic derived property#7520Conversation
… \\w for Mn characters Agent-Logs-Url: https://github.com/RustPython/RustPython/sessions/6f573a91-8811-486c-933d-7ba9a9067643 Co-authored-by: youknowone <69878+youknowone@users.noreply.github.com>
isalpha/isalnum and regex \w to use Unicode General Category instead of Alphabetic derived property
|
@copilot fix the ci failure |
Agent-Logs-Url: https://github.com/RustPython/RustPython/sessions/18cc7216-15cf-40d1-8726-23e21aa4c368 Co-authored-by: youknowone <69878+youknowone@users.noreply.github.com>
Fixed in fab1c0c. The lint failure was caused by |
|
I see unic-ucd is already in the repo. No worries for now. |
|
@sunmy2019 Thank you for notifying that. Could you please open an issue about that? Let's try if we can move on |
I opened #7529. I run an A/B test on
|
|
@copilot Check this comment: #7520 (comment) @sunmy2019 Thanks! |
…regression tests Agent-Logs-Url: https://github.com/RustPython/RustPython/sessions/43f9f7dc-f5af-48b8-b93b-6363bcda334c Co-authored-by: youknowone <69878+youknowone@users.noreply.github.com>
... Done in b9cbd51. Added regression assertions for codepoints To make those tests pass, I also fixed the underlying issue: |
|
Interesting. Unicode version does matter here. Python 3.14 uses Unicode 16.0.0 and Rust 1.92.0/Python 3.15 uses Unicode 17.0.0. Now I ran tests between isalpha: 160 (out of 1114112) mismatches: chr(i) for i in [2199, 3076, 3315, 3457, 6847, 6848, 6860, 6861, 6862, 7410, 7411, 43263, 68900, 68901, 68902, 68903, 68969, 69291, 69292, 69370, 69371, 69372, 69747, 69748, 69826, 69957, 69958, 70094, 70095, 70209, 70584, 70585, 70586, 70587, 70588, 70589, 70590, 70591, 70592, 70594, 70597, 70599, 70600, 70601, 70602, 70604, 70605, 71724, 71725, 71726, 71727, 71728, 71729, 71730, 71731, 71732, 71733, 71734, 71735, 71736, 71984, 71985, 71986, 71987, 71988, 71989, 71991, 71992, 71995, 71996, 72000, 72002, 72145, 72146, 72147, 72148, 72149, 72150, 72151, 72154, 72155, 72156, 72157, 72158, 72159, 72164, 72544, 72545, 72546, 72547, 72548, 72549, 72550, 72551, 73098, 73099, 73100, 73101, 73102, 73104, 73105, 73107, 73108, 73109, 73110, 73459, 73460, 73461, 73462, 73472, 73473, 73475, 73524, 73525, 73526, 73527, 73528, 73529, 73530, 73534, 73535, 73536, 90398, 90399, 90400, 90401, 90402, 90403, 90404, 90405, 90406, 90407, 90408, 90409, 90410, 90411, 90412, 90413, 90414, 94031, 94079, 94080, 94081, 94082, 94083, 94084, 94085, 94086, 94087, 94192, 94193, 94196, 94197, 94198, 123023, 124643, 124646, 124654, 124655, 124661]isalnum: 157 (out of 1114112) mismatches: chr(i) for i in [2199, 3076, 3315, 3457, 6847, 6848, 6860, 6861, 6862, 7410, 7411, 43263, 68900, 68901, 68902, 68903, 68969, 69291, 69292, 69370, 69371, 69372, 69747, 69748, 69826, 69957, 69958, 70094, 70095, 70209, 70584, 70585, 70586, 70587, 70588, 70589, 70590, 70591, 70592, 70594, 70597, 70599, 70600, 70601, 70602, 70604, 70605, 71724, 71725, 71726, 71727, 71728, 71729, 71730, 71731, 71732, 71733, 71734, 71735, 71736, 71984, 71985, 71986, 71987, 71988, 71989, 71991, 71992, 71995, 71996, 72000, 72002, 72145, 72146, 72147, 72148, 72149, 72150, 72151, 72154, 72155, 72156, 72157, 72158, 72159, 72164, 72544, 72545, 72546, 72547, 72548, 72549, 72550, 72551, 73098, 73099, 73100, 73101, 73102, 73104, 73105, 73107, 73108, 73109, 73110, 73459, 73460, 73461, 73462, 73472, 73473, 73475, 73524, 73525, 73526, 73527, 73528, 73529, 73530, 73534, 73535, 73536, 90398, 90399, 90400, 90401, 90402, 90403, 90404, 90405, 90406, 90407, 90408, 90409, 90410, 90411, 90412, 90413, 90414, 94031, 94079, 94080, 94081, 94082, 94083, 94084, 94085, 94086, 94087, 94192, 94193, 123023, 124643, 124646, 124654, 124655, 124661]The reason is that Rust's For example, I think what we needed here is a newer rust library, providing correct unicode category with Unicode 17.0.0. If unic-ucd-category were still maintained, a simple update should have fixed this. This also affects stdlib |
|
@sunmy2019 Thank you for the good analysis! |
Rust's
char::is_alphabetic()/char::is_alphanumeric()use the Unicode Alphabetic derived property, which is a superset of the Unicode letter categories and includes some non-spacing marks (Mn). CPython'sstr.isalpha()andstr.isalnum()are defined strictly by General Category membership (letters: Lu/Ll/Lt/Lm/Lo; numbers: Nd/Nl/No), causing divergence for characters like U+0345 (COMBINING GREEK YPOGEGRAMMENI, category Mn).Changes
crates/vm/src/builtins/str.rs: Replacechar::is_alphabetic()/char::is_alphanumeric()inisalpha()andisalnum()with explicitGeneralCategorychecks (L* forisalpha; L* + Nd/Nl/No forisalnum). WhenGeneralCategoryreturnsUnassigned(i.e. the character is unknown tounic-ucd-category's Unicode 10.0.0 database), fall back tochar::is_alphabetic()/char::is_alphanumeric()so that letters added in Unicode 11+ (e.g. U+0560, U+0588, U+05EF, U+0870+) are correctly recognised.crates/sre_engine/src/string.rs: Apply the same fix tois_uni_alnum()(used for\win regex); remove the pre-existing// TODO: check with cpythonnote.crates/sre_engine/Cargo.toml: Addunic-ucd-categoryworkspace dependency.extra_tests/snippets/builtin_str_unicode.py: Add regression assertions for U+0345 acrossisalpha,isalnum, andre.match(r"\w", ...).extra_tests/snippets/builtin_str.py: Add regression assertions for letter codepoints from Unicode 11+ ([1376, 1416, 1519, 2160–2166]) that were previously misclassified as non-alphabetic due to the stale Unicode 10.0.0 table inunic-ucd-category 0.9.0.Original prompt
This pull request was created from Copilot chat.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.