Conversation
18ce341 to
c8587f1
Compare
|
I don't understand the test failures. They appear to be unrelated to these code changes. Could you take a look, @ironprogrammer ? Thanks. |
|
In response to @arnt:
Absolutely, please do add unit tests (or updates to existing tests) for
It's not uncommon for E2E test timeouts to occur in this repo's automated jobs. These can be ignored. |
|
Thanks for the updated PR! The updates to This isn't a formal test report (there are lots of scenarios needed), but I ran through a few smoke tests so as to get back to you quickly on what could be addressed next. It would be great to get additional reviewers/testers involved to cover what I've missed:
|
|
Sigh. I see what you mean. My own focus is/was quite different: I wanted Contact Form 7 and similar to accept unicode email addresses, which requires some utility functions in Wordpress to do the right thing. I didn't aspire to complete support in Wordpress. The thinking behind your response seems to be that either Wordpress supports it properly or not at all, there's no inbetween. And, sadly, I think you're right and I was wrong. I'll extend the PR. It'll take a few weeks. I'll also find some testers. |
|
Hi, you asked for wider testing. Around 20 sites have now run the patch in production, the longest-running one for more than six months already. Nothing has broken. The sites are real (University of Somewhere, not someone's private five-visitor site). If you want, I can send you the email addresses of some testers. When I pinged them last week, around half said it was okay to give you their addresses, I'm sure most will say the same if I ask a few times ;) |
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Unlinked AccountsThe following contributors have not linked their GitHub and WordPress.org accounts: @peteresnick. Contributors, please read how to link your accounts to ensure your work is properly credited in WordPress releases. Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
I liked the accounts earlier today. |
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
|
Thanks for the update, @arnt, and for seeking wider testing! There are still some outstanding questions and one previous failure to consider, but it's great to see this moving forward! 🙌🏻 A good next step would be to rebase this PR on the latest |
|
Finding testers who were willing and able to run a patch on sites that mattered to them was hard work. That took a good long while. Do you want their names? If so, please send me mail. Rebasing is not a problem, of course. The other stuff was a little difficult. I only recently found a good workaround for IntlChar's lack of support for UAX24. Can you suggest a model end-to-end test that I could emulate and write automatic tests for the desired features? Manual testing is possible, but I'd much prefer to have complete automatic test coverage. |
cef6fe4 to
28b954e
Compare
|
Hi, the remaining problems, as far as I can tell:
|
79ee473 to
3b0d8e5
Compare
|
I added some code to prevent PHP 7.2.24-7.3.x from choking on mb_str_split(). A bit on testing:
And… oh drat… Must retest the site setup wizard, too much code has changed. Two questions remain.
|
|
Just set up a new site from scratch using 'rød@grå.org' as admin address; it works. |
|
Hi @ironprogrammer, would you mind having a look? Also, on an unrelated matter, I'm with you. I've worked as a full-time maintainer on a widely used open source project and got a lot of flak. It hurt. Some people can be really shitty. Don't let it grind you down. |
|
Thanks a lot, @arnt, and apologies for not being involved with this the past several months 🙏🏻. I've got some other work priorities at this time, but hope to be able to dive back in soon! In the meantime, for code review purposes, I would suggest raising this Trac ticket during #core Dev Chat, or dropping in a comment on the next agenda post to add it to open floor discussion. This will help spread some awareness of the initiative. If you're available on Slack and sync during that time to answer questions, that would be ideal. Glad you could wrangle additional testing for the updates! If this info can be made public through test reports in Trac, that would be amazing -- and also prerequisite to merging as Core committers begin taking a look at the ticket. You could get additional visibility by sharing the ticket in the #core-test channel. There may also be contributors willing to help put test instructions together based on what you've got so far. This update covers a lot of territory, so the more detailed the tests for each use case, the better. |
fb001aa to
d87d101
Compare
f0456b1 to
7f3ae71
Compare
| * Verifies that an email is valid. | ||
| * | ||
| * Does not grok i18n domains. Not RFC compliant. | ||
| * The mostly matches what people think is the format of email |
There was a problem hiding this comment.
Little typo here: **The** mostly matches...
| $j = rand( 0, 1 + $hex_encoding ); | ||
|
|
||
| if ( 0 === $j ) { | ||
| if ( ord( $email_address[ $i ] ) > 127 ) { |
There was a problem hiding this comment.
Isn't more clear >= 128 ?
Maybe I would also add a little note here for reference.
There was a problem hiding this comment.
It’s not likely clearer to those who work with text at a low level. 0x7F or 0x80 might be, but it’s still a subjective call. US-ASCII is 0x00–0x7F (0–127) so that’s where the use of this value corresponds to the underlying system.
There was a problem hiding this comment.
But >127 saves a byte from the download!
There was a problem hiding this comment.
I believe we will need more than one byte to save the world 😅
There was a problem hiding this comment.
But >127 saves a byte from the download!
This claim is not true, of course. When zipped or otherwise compressed we may not be changing the size at all. That would be interesting to know, of course, but not significant, particularly because we’re talking about the server downloading code and not the browser.
It likely doesn’t even save one byte of storage space given that storage is allocated in blocks. If it were to increment the storage space at all, it would probably be 4 kilobytes, or 512 bytes, or whatever the inode/sector/block size is of the device and/or filesystem.
There was a problem hiding this comment.
I confess that I was not entirely serious about that byte. Should have added a smiley. But oh! I don't have smileys on my keymap! Maybe I ought to join the 21st century and fix that.
dmsnell
left a comment
There was a problem hiding this comment.
Sorry to leave a partial review, but I am losing network access and wanted to get what I have out the door.
I think there’s a lot we can do to either shrink or grow this change, and I’m worried about rushing it for 6.9.
Revamped support for internationalized email would be a lovely headliner for WordPress 7.0, among others. Coincidentally I pushed out #9313 not too long ago.
I’ll come back later and finish working through this, but perhaps there will be updates before then. Thanks for putting up with all the nitpicks @arnt.
src/wp-includes/formatting.php
Outdated
| // If strict, reduce to ASCII for max portability. | ||
| if ( $strict ) { | ||
| // If mixing different scripts, remove all but ASCII. | ||
| if ( ! uses_single_unicode_script( $username ) ) { |
There was a problem hiding this comment.
Can you talk about why you created this function and made the choice to strip non-ASCII, and to do it only when detecting characters from different script families?
One might assume that the goal here is to avoid mixed-script spoofing, but without any explanation it’s just guessing.
If we are trying to prevent spoofing, I might have follow-up questions about that, about the method, and about the approach in this PR. If not, I am very curious to know why it’s here.
There was a problem hiding this comment.
I tried to reconcile several partially conflictinging goals, safely:
- allow single-script logins, because the names most people want are based on the real-world names they have, which are single-script.
- maintain the greatest possible compablity with what the code did before.
- avoid FAQs
People are so quick to talk about Latin and Cyrillic letter that look like each other. It's an instant objection and tends to derail conversations about unicode and identifiers. So I tried to allow the kind of nicenames that are the goal, maintain philosophical compatibility with existing use of $strict and defuse the instant objection. The table became a little bigger than I liked, but on the other hand it's stopped growing: The unicode consortium isn't adding even more blocks.
I don't actually think spoofing is a priority here. People talk about spoofing all the time and did you knw that Latin and Cyrillic a look like each other. Meanwhile the spoofers use lower-case L and upper-case i, or just don't bother: foobanksecurity432423@hotmail.com is a workable spoof for info@foobank.com.
Finally, I also wanted to be a little safe about the way to get there. I wanted to extend the syntax to what I know is desired now, then maybe go further in a next step. It's easy to extend the set of permissible logins, difficult to restrict it when someone may already use a newly impermissible name.
There was a problem hiding this comment.
One more thing. I didn't make the choice to strip non-ASCII, that was in the code already, last changed in 2020. Rather, I tried to keep as much existing behaviour as possible.
There was a problem hiding this comment.
I thought it was about spoofing; while I would welcome additional thoughts from others, I think most evidence points to spoofing being less of a real risk than some fear it is, so I do question some of the value of taking this unique classification and presenting variant behaviors in cases that aren’t obvious.
Many valid legitimate names involve multiple scripts anyway.
In general I wouldn’t worry too much about predicted “instant objections.” If people do have objections they can share them, but I wouldn’t want anyone to feel pressure to propose things nobody’s is going to end up wanting, out of an abundance of caution…
There was a problem hiding this comment.
I'm not familiar with legitimate cross-script use other than a couple of special cases, but if you're willing to drop that function, I'll be happy to push a version without it. Simpler code is nicer code.
| $j = rand( 0, 1 + $hex_encoding ); | ||
|
|
||
| if ( 0 === $j ) { | ||
| if ( ord( $email_address[ $i ] ) > 127 ) { |
There was a problem hiding this comment.
It’s not likely clearer to those who work with text at a low level. 0x7F or 0x80 might be, but it’s still a subjective call. US-ASCII is 0x00–0x7F (0–127) so that’s where the use of this value corresponds to the underlying system.
| $j = rand( 0, 1 + $hex_encoding ); | ||
|
|
||
| if ( 0 === $j ) { | ||
| if ( ord( $email_address[ $i ] ) > 127 ) { |
There was a problem hiding this comment.
It’s possible to debate the value of this function today, but doesn’t this change effectively disable it for non-ASCII characters?
If we make such a large change it would be best to have discussion about it and do it intentionally.
There are tools we can use to work through UTF-8 code points and not have to make this choice. See my comment above on the script detection.
There was a problem hiding this comment.
It does disable the function for such addresses, and that isn't a problem right now.
ICANN gathers abuse data (primarily to forward it to registrars in real time), I have access to the statistics, and unicode addresses aren't a factor now. There are reasons for that and I'd be glad to talk about it, but whatever the reasons, a function called 'antispambot' can relax about that sort of address. It needs to deal with the effects, not with the reasons.
There was a problem hiding this comment.
Regarding antispambot. I'm at the IETF124 meeting this week, and today I presented an internet-draft whose chief goal is robust detection of bad unicode addresses, in various senses of 'bad'. That draft isn't ready to ship in WP yet.
I think you and I might discuss on Slack what approach antispambot could/should take, and I'll contribute a PHP implementation when the draft's at Last Call, if you want it.
There was a problem hiding this comment.
To be clear, I suspect this function is entirely useless and might provide as much protection as returning its input unchanged would.
my point is that if we’re going to introduce changes to it as part of this patch, we might want to follow form instead of changing its behavior for some new inputs. Or we could discuss removing all of it, but in a separate and focused discussion
There was a problem hiding this comment.
It has to be changed as part of this patch: I became aware of it because a unit test broke. A breaking unit test is a serious matter.
I too feel that its utiliy is difficult to see. If you don't want it, fine, fine! If you do want something like it, I'd like to contribute something later, without holding up this PR.
src/wp-includes/formatting.php
Outdated
|
|
||
| if ( version_compare( PHP_VERSION, '7.4.0', '<' ) ) { | ||
| // Since mb_str_split is not available in PHP < 7.4 we can only check ASCII characters. | ||
| return (bool) preg_match( '/^[a-zA-Z0-9 _.\-@]+$/i', $input ); |
There was a problem hiding this comment.
because preg_match() can return false in addition to 0 it’s generally wise to avoid casting its return to boolean. maybe we say in this case it doesn’t matter, but if, say, the PCRE engine does do something weird and fail, we get fuzzy results.
in any case isn’t the return inverted? if the PCRE pattern matches, meaning there are non-ASCII (or ASCII outside of the checked range), this returns (I saw the true. if, on the other hand, there are only ASCII letters and numbers it will return false. am I missing something?^ inside of the [] instead of where it actually is, on the outside)
There was a problem hiding this comment.
I get your point. In this case, however, I think the correct resolution is to delete this code as soon as WP requires PHP 7.4.
The cast is a lesser evil than the poor functionality before PHP 7.4 IMO.
I tried to write a unit test for unexpected failure now, and couldn't. It seems that the only failure mode is regex syntax byut the regex is constant and valid. If someone changes the regex, the existing unit tests will catch the mistake.
src/wp-includes/formatting.php
Outdated
|
|
||
| $block = 0; | ||
| // phpcs:ignore PHPCompatibility.FunctionUse.NewFunctions.mb_str_splitFound -- old versions of PHP are handled above | ||
| foreach ( mb_str_split( $input ) as $cp ) { |
There was a problem hiding this comment.
while I have questions in this function, I want to be careful about clouding the review with minutia which may or may not matter later. still, because I saw issues with the iteration twice in this proposed patch I wanted to let you know that we do have better tools within WordPress to handle it.
This is ideally a job anyway for grapheme_extract() since that doesn’t preallocate an array for each character in the string. That may not be available, but WordPress 6.9 introduces a private internal utility called _wp_scan_utf8() which provides the fallback for iterating over code points.
I’ve been thinking of polyfilling grapheme_extract() and we can do that if we need to. In the meantime, a decent fallback involves a combination pre-scanning past ASCII bytes and then calling _wp_scan_utf8() for multi-byte code-point iteration.
$at = 0;
$end = strlen( $input );
while ( $at < $end ) {
$ascii_span = strspn( $input, " -.@0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\\_abcdefghijklmnopqrstuvwxyz" );
if ( $ascii_span > 0 ) {
// has the checked ASCII chars…
}
$at += $ascii_span;
if ( $at >= $end ) {
break;
}
$next_at = $at;
$invalid_length = 0;
$found_code_points = _wp_scan_utf8(
$input,
$next_at,
$invalid_length,
/* max bytes */ null,
/* max code points */ 1
);
if ( $invalid_length > 0 ) {
// invalid UTF-8. not necessary to check this if validating the string first
}
$code_point = substr( $input, $at, $next_at - $at );
$at = $next_at + $invalid_length;
}Looks like a lot of code, but that’s because it’s intentionally a low-level interface.
There was a problem hiding this comment.
grapheme_extract() doesn't actually help here. The function that's really needed wasn't exposed by IntlChar, even though it has to exist, PCRE uses it.
A grapheme is generally a single codepoint in a block such as BLOCK_CODE_LATIN_SOMETHING followed by zero or more combining code points from e.g. COMBINING_DIACRITICAL_MARKS. This code iterates over the code points and ignores those in COMBINING_blah, then checks whether there are code points in different blocks. Splitting into graphemes and then ignoring the COMBINING_blah isn't an improvement.
It would be much better to get the script property from IntlChar. With that, this function would be ten lines or so.
It might be advisable to delete the whole thing later. Although I'm wary about that, as I mentioned people are very quick to talk about homoglyphs.
There was a problem hiding this comment.
Extremely pleasant to discuss with someone who understands unicode btw.
There was a problem hiding this comment.
grapheme_extract() is poorly named. If you read its manual page that I linked you will see that it supports iterating over code points, bytes, and grapheme clusters.
it will provide what you are doing using mb_str_split() but avoids the eager allocation into an array.
There was a problem hiding this comment.
I see. And it even seems to be available in PHP 7.2. Neat! I'll push a new revision that uses that earlyish tomorrow.
There was a problem hiding this comment.
@SergeyBiryukov in case you missed this thread, I am discouraging the use of mb_str_split() due to its needlessly bloated memory characteristics and due to the fact that we have functions which provide a closer match to the purpose of this code: to scan through code points.
src/wp-includes/formatting.php
Outdated
|
|
||
| // If strict, remove reduce to letters and numbers. | ||
| if ( $strict ) { | ||
| $username = preg_replace( '|[^a-z0-9 _.\-@\p{L}\p{N}]|iu', '', $username ); |
There was a problem hiding this comment.
there are many existing cases missing this, but we should check for UTF-8 PCRE support before calling it otherwise this is guaranteed to set $username = false
$username = _wp_can_use_pcre_u()
? preg_replace( '|…|iu', '', $username )
: preg_replace( '|[^a-z0-9 _.\-@]|', '', $username );though it would be possible still to fallback to a manual replacement method if we care that much.
this all highlights how much we could carefully revise all of the assumptions in these functions and think through the implications of what we allow or not
There was a problem hiding this comment.
I didn't bother with any test since existing code assumed that /u is supported. Here is a commit from 2019.
I looked at other usage now to emulate existing fallbacks, and didn't find any fallbacks ;) Not sure how to handle this.
There was a problem hiding this comment.
That’s why I was trying to share how it ought to be done. This check has been missed before, but it’s ideal not to add more unchecked code. There’s work in progress to remove all of the unchecked code.
There was a problem hiding this comment.
Well, OK.
I see you even added a way to mock absence. Will push something suitable.
|
I'm grateful for the nitpicks. Seriously. I also wish this could be shipped; that would make life simpler for me. I'd like to have a talk about growing/shrinking it. I'll keep an eye on Slack tomorrow. Have several meetings though, including one which I chair, so I might be pressed for time, depending on your timezone. I decided not to rebase/push now. @SirLouen's comment change clearly should go in but it's minor, and I'd rather like to improve that cast but don't see how to write a unit test. I think we can continue talking without a new push. |
Btw, also while I was reviewing this patch, I sent you a little PR with some little extra formatting tweaks for the tests part. Nothing big, just improving CS a bit and setting a wider group to test this ticket at once. |
|
@SirLouen I'll go and have some food now, and merge your PR either when I return or tomorrow morning. I really don't want to break anything due to jetlag and lack of focus. |
having just jumped ahead by nine hours, I feel something similar @arnt — hope you recover quickly. here are some things I reviewed yesterday:
I was left with a few more questions in general though:
CREATE TABLE test_l1 (t varchar(100) character set latin1 collate latin1_swedish_ci) DEFAULT CHARSET=latin1;$wpdb->query( "INSERT INTO test_latin1 (t) VALUES ('a\u{1f170}b\u{1f171}')" );
$wpdb->query( "INSERT INTO test_latin1 (t) VALUES (_latin1'a\u{1f170}b\u{1f171}')" );MariaDB [wordpress]> SELECT * FROM test_l1;
+----------------------+
| t |
+----------------------+
| a?b? |
| a🅰b🅱 |
+----------------------+
2 rows in set (0.000 sec)In the first case, we sent UTF-8 data to the database, which knows that the table is storing php > var_dump( $wpdb->get_col( 'SELECT t FROM test_l1' ) );
array(2) {
[0]=>
string(4) "a?b?"
[1]=>
string(20) "a🅰b🅱"
}
php > var_dump( $wpdb->get_col( 'SELECT CAST(t as binary) FROM test_l1' ) );
array(2) {
[0]=>
string(4) "a?b?"
[1]=>
string(10) "a🅰b🅱"
}…unless we also do something on the read query side then the database will perform the same transparent re-encoding on read turning those UTF-8 bytes back into UTF-8 and thus double-encoding them. This seems problematic to me in a new way because after this change we are allowing people to create accounts with Unicode login names and emails, but these database queries won’t fail; they will corrupt the data and someone might lose access to the account they just created, or to an account for which they just changed their email address. If we want to ensure this doesn’t happen we would need to trace the data through all places that access these and ensure we properly manage the character encoding in the SQL session, something that doesn’t tend to go well in WordPress due to the distributed nature of code contributions and plugins. To me this highlights the need for deprecating non-UTF-8 support (Core-62172), though I guarantee we will be doing whatever we need to to support Unicode emails before we do that.
As named this PR seems very grand in purpose: what does it even mean to support RFC 6530? I feel like as a Core tracking ticket that is appropriate, but the work is likely to involve many independent PRs and changes, including some big semantic changes around usernames and emails, possibly involving some database schema changes. For now, however, if we assume that this PR should move forward, here are some things I propose, pending further security review:
Further, I think we need to talk about what email characters are allowable. Our current set of rejected ASCII character includes octets which could be allowable email addresses. If our intention is to be restrictive, say to reject non-printable characters, or ambiguous whitespace, then we probably ought to see if we can replicate that experience for the upper Unicode planes too and not just turn away from those goals because it’s not ASCII. Please enjoy your conference. There’s no rush; as this work will probably take a lot of cooperation. We’ll bring it up next week in the developer chat. |
At a glance, would a polyfill for |
|
@SergeyBiryukov I would say that a polyfill makes sense if WP is to support PHP 7.2-3 for a while and a polyfill is simple. The support table makes it look as if support might be dropped in WP 7.0? @dmsnell that's a long and very welcome message. Do you think I should attend to that now, or leave you in peace until 6.9 is released? I'd really like to merge this soon, for several reasons, but I imagine the 6.9 release needs developer focus. |
@arnt you don’t need to worry about me or my schedule. just like anticipated objections, it will likely be more advantageous to just move at your own pace and let others work at theirs, be it fast or slow. I am guessing the most important thing right now is figuring out the issue of database corruption since this creates a new avenue for that, and for potentially locking users out of their accounts because of it. we don’t necessarily have to solve it, but it’s a noteworthy impact from this change and creates a problem we didn’t previously have. |
|
I'll write several separate comments here, on separate topics, and push a new revision at the end. First, cross-script usernames. There's effectively no evidence of cross-script usernames being used, either for good or for bad. There's an exception and ha half for good: ① Q and other Latin letters appear to be used a little bit in Chinese, including the well-known name 阿Q, and ② Arabic digits (0-9) are used with many writing systems, even ones that have other digits. (Arabic itself is an example of that, in the areas east of Lebanon/Jordan.) I know about web forums etc. that allow non-ASCII usernames. Unfortunately for this, the ones I know only allow one script. There's domains, though. Domains in many TLDs are like self-service usernames, so that experience is instructive. ICANN receives abuse reports for domains and forwards them to registrars; I don't read that feed closely but I think I would hear about examples of cross-script similarity abuse. I haven't so far. Visually similar letters have been used in real attacks, but I don't know about any cross-script use. Rather, i has been used for l (PaypaI a long time ago, phiIIips recently) and 0 for o, and one clever attack used a little-known diacritic and an all-Latin domain. Impersonators mostly rely on non-letter similarity as far as I can tell. "Look, my email or web page has your bank's logo so I must be a legit representive of your bank." "My bluesky account has a photo of that person so you can trust that I really am that person." It's similarity but it's not letters. However, I suspect that if it's possible to impersonate a wordpress user on the same site and use the impersonation for a swindle, someone will eventually do it. It's more likely to be done by copying a profile photo than by leveraging cross-script similarity. So… I think it's reasonable to drop single_unicode_script, either nor or later. The only substantive reason to keep is that having it from the start is IMO easier than adding it later. |
|
About UTF8 and other database encodings. As I understand it, Wordpress 2.2 started using UTF8 for new databases and 4.2 converted old databases to UTF8, except when admins chose to override that. Overriding is possible to this day. There will be a few sites that do override. I read ticket 62172 and it makes a great deal of sense to me. Agree about the brittleness. However, that change probably shouldn't be conflated with this. It seems likely that a non-UTF8 database will suffer corruption in other tables, not just the usernames. That may be less serious. It won't lock anyone out. However, this problem could affect the core value of the site. When you publish a page, saving it to a non-UTF8 database can turn a dozen different currency symbols into question marks. Easy to overlook. Far from good. It seems… less urgent than usernames but worth solving. Having written all this, I think I've formed an opion:
Does that sound good? |
|
Next up: The scope of this PR. As @dmsnell asks:
The first version of the PR did indeed touch only addresses. I extended it by request because as contributor, I try to conform to whatever is expected. At this point I'd prefer to keep the agglomerated PR because:
|
|
Email address length: Wordpress currently supports 100 bytes reliably, more unreliably. With the change I suggest above it'll store 100 characters reliably or reject the address at entry time. Email address length limits are a bit of a mess in general. This is partly due to the specifications. The length limit for an email address differs from the length limit for an email address on the public internet. Too subtle for most people. Then there's widespread disobedience. and even worse, there's software and SaaS vendors that judge email addresses and have their own unpublished length limits. If you enter a 70-byte email address in a newsletter signup form, you may well be rejected because a SaaS vendor classifies the address as a bot. Both 100 bytes and 100 characters are enough that any human who tries to use an address near the limit will run into problems quickly. Much software has shorter limits (a well-known CRM has 80) and few people can spell reliably enough to type a really long address correctly from a business card. It's sad, I wish the situation were cleaner, but 100 is a safe length in the world as it stands, and there isn't really any number that's obviously right. FWIW, my name is in both RFC 5321 (length on net) and 5322 (length). |
I’d like to hear the thoughts from some others who have more experience dealing with text encoding issues, particularly in places where we create new ones. Perhaps @mdawaffe has some or knows someone with thoughts. A check for round-tripping seems reasonable and awkward at the same time, because making those checks everywhere would be excessive. For an email address though, maybe it’s worth the diversion.
We’re discussing a new reality which has no prior in WordPress and which can be dramatically different. The moment someone uses the flag of England in their email address, that’s 28 characters alone. Meaning that if someone uses the US flag in their email they are allowed 92 other ASCII characters but for the England flag it only leaves 72. If we only allow storing Unicode emails on database tables with UTF-8 support then we could avoid this problem. Maybe that’s a reasonable feature-gate for this: don’t enable non-ASCII email address unless the Were we to pursue that angle, it could be worth imagining a new global toggle or check like
Perhaps it’s my poor reading skills, but I re-read that comment multiple times and couldn’t find the request you’re referring to. Maybe @ironprogrammer left another comment on an older version of the code? You are likely underselling your opinions here. This is a collaborative effort and we are all working together; I know from your previous comments that there are times you read more into things than people leave when they comment. It’s okay to ask questions and engage in the back and forth — if we all instantly did whatever anyone asked us then we’d have chaos, or LLM-level-slop. I’d really love to consider this only in the context of sites with a There are a few places email addresses are read from query args: I think in the initial site installation and in some new user requests. This is not relatively important compared to the other feedback, but it will be something we want to make sure we check at some point, largely to ensure that when decoded they are valid UTF-8 and that they are properly percent-decoded. Perhaps it’s fine as-is and the email functions will verify that when passed the information (apart from the percent-encoding). Either way, it’d be great to confirm what happens when some emails are sent with invalid query args. If anyone would be interested in testing the same steps that others like @USERSATOSHI and @ironprogrammer have performed, but with a site setup with It would be valuable to test emails containing “normal” Unicode like non-latin scripts (which use only up to three bytes) as well as addresses containing characters from the supplemental planes, such as emoji (requiring four bytes). |
|
This is a cool proposal. Note that I have not made the time to fully understand the conversation above. I don't think anything I've written below is particularly new: apologies if it is obviously repetitive :)
As you'd imagine from the method name, it strips characters (and optionally truncates the string) rather than bailing. I think, though, you could do a clever I don't think that's necessary, though, since I think we should instead only allow non-ASCII email addresses in
I don't think we can ignore the difference. We should require I don't have a strong opinion about limiting strings to having characters from only one script (plus ASCII). On small sites, it likely doesn't matter at all. On large sites (like WordPress.com or sites where WordPress+bbPress is used for public forums) with user registration available to anyone, spoofing is likely a big concern. I agree that I do have one strong opinion: non-WordPress code should be able to make "raw" SELECTs to the database and retrieve the correct username/email address. That is, we should not write "corrupted" or otherwise transformed data to these columns even if WordPress correctly uncorrupts/untransforms the values when it read them. That means no charset shenanigans or hacky encodings (e.g., Base64). For usernames, another (non-?)option would be to allow any characters, but to reject new usernames that are too "close" to existing ones. I don't know how to do that efficiently without adding a new (indexed) column to the users table, which I think is a bad idea. Fly-by suggestions:
|
|
I just pushed a new revision. I discovered that wpdb already protects against not being able to store the data. The database may corrupt something, but wpdb already tries really hard to avoid getting that far. I added an extra unit test, since locking people out is worse than most failures. Side note: That function uses FWIW I would be unhappy with a toggle for email address support. Those toggles are trouble magnets and I strongly advise against going that way. A toggle for username support is another thing. I know a data set that can be used for username similarity detection… but that's an unpleasant slippery slope, with much scope for pedantry about details. Some of the decisions in the data set puzzled me. Best avoided, I'd say. I've no opinion on restricting this to UTF8 databases. If WP were my baby, I'd announce that support for other databases will be sunset in the middlish future and thereby make the smaller question go away, along with the risk caused by brittle code. I didn't have time to rebase, I'm afraid, but I wanted to get this pushed before I go on a trip tomorrow morning. |
4498e43 to
3dc24ca
Compare
dmsnell
left a comment
There was a problem hiding this comment.
my comment is mostly irrelevant to the bigger picture in this PR, but I saw a thing so I figured I’d point it out to not lose it later
697aa9f to
7b7e00c
Compare
src/wp-includes/default-filters.php
Outdated
| // Email addresses: Allow so long as the database can store them. This | ||
| // affects all addresses, including those entered into contact forms. | ||
| if ( 'utf8mb4' !== $wpdb->charset ) { | ||
| add_filter( 'sanitize_email', 'wp_ascii_without_controls' ); | ||
| } | ||
|
|
||
| // Usernames: Allow if the database can store them. This might be a | ||
| // setting instead, so that a site can restrict its own users to ASCII. | ||
| if ( 'utf8mb4' !== $wpdb->charset ) { | ||
| add_filter( 'sanitize_user', 'wp_ascii_without_controls' ); | ||
| } |
There was a problem hiding this comment.
| // Email addresses: Allow so long as the database can store them. This | |
| // affects all addresses, including those entered into contact forms. | |
| if ( 'utf8mb4' !== $wpdb->charset ) { | |
| add_filter( 'sanitize_email', 'wp_ascii_without_controls' ); | |
| } | |
| // Usernames: Allow if the database can store them. This might be a | |
| // setting instead, so that a site can restrict its own users to ASCII. | |
| if ( 'utf8mb4' !== $wpdb->charset ) { | |
| add_filter( 'sanitize_user', 'wp_ascii_without_controls' ); | |
| } | |
| if ( 'utf8mb4' !== $wpdb->charset ) { | |
| /* | |
| * Email addresses: Allow so long as the database can store them. This | |
| * affects all addresses, including those entered into contact forms. | |
| */ | |
| add_filter( 'sanitize_email', 'wp_ascii_without_controls' ); | |
| /* | |
| * Usernames: Allow if the database can store them. This might be a | |
| * setting instead, so that a site can restrict its own users to ASCII. | |
| */ | |
| add_filter( 'sanitize_user', 'wp_ascii_without_controls' ); | |
| } |
Remove the duplicate conditions and correct the multi-line comment format.
dmsnell
left a comment
There was a problem hiding this comment.
it’s been a while since I’ve reviewed this and I don’t have the time to give that it truly deserves, but I’ll try and rehash some of my thoughts the best I can.
first of all, this is great work from @arnt — someone who understands the scope of the problem and is actively championing for its resolution. I want to comment @arnt for that effort.
but I want to back up and reach for earlier comments I left, which I think are the most-crucial questions to answer here. thankfully, one of them is largely already answered.
I discovered that wpdb already protects against not being able to store the data.
but still it seems we are making a number of implicit changes that aren’t getting enough eyes on them:
- this goes beyond allowing non-ASCII email addresses and imposes new and arbitrary or subjective rules about what kinds of emails are allowed. there is reference to various concepts like Unicode similarity and security concerns, but those policies aren’t technically enforced.
- there are questions about support on database encodings which cannot store UTF-8 adequately. ironically, a
latin1table can store this stuff just fine but will not play well with collation, search, and equality.
it seems like this is stuck in the middle of being comprehensive and not enough, and I think that could be the result of very diligent and honest attempts to meet the review feedback.
for me personally I could imagine it being easier to analyze at either extreme:
- do almost nothing other than open up valid-UTF-8 emails.
- handle all of the easily-foreseeable UX issues created by non-ASCII emails.
I’ll leave a comment in the Trac ticket, but I think this could use some more discussion in the weekly Dev Chat and also a post on Make to gather wider input; there are likely many parties affected by this change.
tests/phpunit/tests/user.php
Outdated
| * A metal umlaut fails because validate_username is | ||
| * strict and n̈ is unfamiliar in every language. | ||
| */ | ||
| $this->assertFalse( validate_username( 'spın̈altap' ) ); |
There was a problem hiding this comment.
this actually seems like the kind of username we would intentionally want to support, being a username someone might choose thinking they are clever, but not the kind I would expect from malicious attacks.
There was a problem hiding this comment.
FWIW it's been used in an attack. The attacker used an accent that's never used in the victims' language, and some victims overlooked the accent. Just a few pixels, not something the eye had practised parsing, easy to overlook.
However. I'm not married to this. I added it for a reason and now you know the reason, but that doesn't make the reason persuasive in Wordpress' context.
tests/phpunit/tests/user.php
Outdated
| public function test_user_corrupted_login() { | ||
| global $wpdb; | ||
| $rows = $wpdb->update( $wpdb->users, array( 'user_login' => hex2bin( 'c0' ) ), array( 'ID' => $this->author->ID ) ); | ||
| $this->assertFalse( $rows ); |
There was a problem hiding this comment.
while I appreciate the extremely-helpful comment on this test, the test itself doesn’t seem to assert anything about the usernames; it only seems to be asserting that $wpdb return a failure on an update that MySQL is presumed to reject based on assumptions about the testing environment.
can we find a way to setup the test into the expected failure mode and make assertions about the code under test?
| 'nonascii@nonascii' => array( 'grå@grå.org', true ), | ||
| 'decomposed unicode' => array( 'gr\u{0061}\u{030a}blå@grå.org', true ), | ||
| 'weird but legal dots' => array( '..@example.com', true ), | ||
| ); |
There was a problem hiding this comment.
we seem to be making assertions about antispambot() here, but we’re also feeding it 100% valid UTF-80and then asserting that the result is valid UTF-8.
is there some case we would expect to fail? how would antispambot() be expected to return non-valid-UTF-8 given valid inputs?
There was a problem hiding this comment.
Well, it broke before. The code looked a little adhoc, so I added a few moderately different test cases and made them pass, then left them in. An antispam function might be modified in a hurry... best to leave a few tests.
| * Test for invalid characters. | ||
| */ | ||
| $local = preg_replace( '/[^a-zA-Z0-9!#$%&\'*+\/=?^_`{|}~\.-]/', '', $local ); | ||
| $local = preg_replace( '/[^a-zA-Z0-9!#$%&\'*+\/=?^_`{|}~\.\x80-\xff-]/', '', $local ); |
There was a problem hiding this comment.
it seems like the intent behind this change is to open up all bytes for a domain other than the C0 control characters, including space, and also adding backspace/delete.
rather than accept any invalid UTF-8, would it make sense to start by rejecting invalid UTF-8 strings, and then perform a check for the rejected controls characters?
if (
! wp_is_valid_utf8( $local ) ||
strcspn( $local, "\x00\x01\x02...\x20\x7F" ) !== strlen( $local )
) {
return apply_filters( ... );
}There was a problem hiding this comment.
I like this. I seem to remember considering something like this, but the function that was there at the time didn't charm me. wp_is_valid_utf8 is so much nicer.
171e2e3 to
172067a
Compare
…tize_email This adds support for the unicode address extensions in RFC 6530-3, adds unit tests for that, extends the documentation to explain the relationship between this code and the various specifications, and finally adds unit tests to ensure that the documentation's description of the code remains correct. During testing, it became clear that antispambot() worked only for strings using a single-byte encoding, while this uses UTF8. Fixed. Fixes #31992. Props SirLouen, dmsnell, tusharbharti, mukeshpanchal27.
172067a to
1514f31
Compare

This adds support for unicode email addresses in is_email and sanitise_email. It is intended to be sufficient to accept unicode email addresses in contact forms etc.,
it does not aspire to unicode support everywhere, in particular it doesn't deal with unicode in logins or passwords.
It also does not add a tests/formatting/sanitizeEmail.php; I wouldn't mind adding one, but don't quite understand your testing policies. It looks like you want to keep the tests fast by avoiding unnecessary tests, I'd say, but I don't understand what you consider necessary and unnecessary. Anyway, if you want a file I'm happy to add one.
Trac ticket: https://core.trac.wordpress.org/ticket/31992