Skip to content

Support rfc6530#5237

Open
arnt wants to merge 1 commit intoWordPress:trunkfrom
arnt:support-rfc6530
Open

Support rfc6530#5237
arnt wants to merge 1 commit intoWordPress:trunkfrom
arnt:support-rfc6530

Conversation

@arnt
Copy link
Copy Markdown

@arnt arnt commented Sep 18, 2023

This adds support for unicode email addresses in is_email and sanitise_email. It is intended to be sufficient to accept unicode email addresses in contact forms etc.,

it does not aspire to unicode support everywhere, in particular it doesn't deal with unicode in logins or passwords.

It also does not add a tests/formatting/sanitizeEmail.php; I wouldn't mind adding one, but don't quite understand your testing policies. It looks like you want to keep the tests fast by avoiding unnecessary tests, I'd say, but I don't understand what you consider necessary and unnecessary. Anyway, if you want a file I'm happy to add one.

Trac ticket: https://core.trac.wordpress.org/ticket/31992

@arnt arnt force-pushed the support-rfc6530 branch 2 times, most recently from 18ce341 to c8587f1 Compare September 18, 2023 12:13
@arnt
Copy link
Copy Markdown
Author

arnt commented Sep 22, 2023

I don't understand the test failures. They appear to be unrelated to these code changes. Could you take a look, @ironprogrammer ? Thanks.

@ironprogrammer
Copy link
Copy Markdown
Member

In response to @arnt:

It also does not add a tests/formatting/sanitizeEmail.php; I wouldn't mind adding one...

Absolutely, please do add unit tests (or updates to existing tests) for sanitize_email(). While Core test code coverage is pretty low today 😓, it is a project policy that new Core merges include appropriate tests.

I don't understand the test failures. They appear to be unrelated to these code changes. Could you take a look, @ironprogrammer ? Thanks.

It's not uncommon for E2E test timeouts to occur in this repo's automated jobs. These can be ignored.

@ironprogrammer
Copy link
Copy Markdown
Member

Thanks for the updated PR! The updates to sanitize_email() appear to have resolved the issue with storing the correct email in the database 🎉.

This isn't a formal test report (there are lots of scenarios needed), but I ran through a few smoke tests so as to get back to you quickly on what could be addressed next. It would be great to get additional reviewers/testers involved to cover what I've missed:

  • ✅ Add New User: email with unicode characters is accepted and saved to the database correctly.
  • ✅ Profile (edit user): email can be changed with unicode characters, and when saved, persists correctly.
  • ❓ Do new user creation, email change confirmation, or password change emails go to the correct address?
  • ❓ Do links in these emails work correctly?
  • ❓ Does the forgotten password workflow work correctly with a unicode address?
  • ✅ Login: logging in with a unicode email works as expected.
  • ✅ Add post comment: email with unicode is accepted for post comments.
  • ❌ Comments (approval list): emails with unicode are not displayed correctly (Figure 1).
  • ✅ In the Comments list, the Email field under Quick Edit does appear correctly.
  • ❓ Site setup wizard: can the initial admin use a unicode address, and do the emails work correctly?
  • ❓ Do various author dropdowns display unicode addresses as expected?
  • ❓ Are unicode emails in feeds presented accurately?

Figure 1: Display of commenter unicode email addresses.
comments approval screen with incorrectly displayed unicode email addresses

@arnt
Copy link
Copy Markdown
Author

arnt commented Sep 27, 2023

Sigh. I see what you mean.

My own focus is/was quite different: I wanted Contact Form 7 and similar to accept unicode email addresses, which requires some utility functions in Wordpress to do the right thing. I didn't aspire to complete support in Wordpress. The thinking behind your response seems to be that either Wordpress supports it properly or not at all, there's no inbetween.

And, sadly, I think you're right and I was wrong.

I'll extend the PR. It'll take a few weeks. I'll also find some testers.

@arnt
Copy link
Copy Markdown
Author

arnt commented Sep 16, 2024

Hi,

you asked for wider testing. Around 20 sites have now run the patch in production, the longest-running one for more than six months already. Nothing has broken. The sites are real (University of Somewhere, not someone's private five-visitor site).

If you want, I can send you the email addresses of some testers. When I pinged them last week, around half said it was okay to give you their addresses, I'm sure most will say the same if I ask a few times ;)

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 16, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Unlinked Accounts

The following contributors have not linked their GitHub and WordPress.org accounts: @peteresnick.

Contributors, please read how to link your accounts to ensure your work is properly credited in WordPress releases.

Core Committers: Use this line as a base for the props when committing in SVN:

Props agulbra, sirlouen, mukesh27, justlevine, tusharbharti, dmsnell, ironprogrammer, sergeybiryukov, mdawaffe.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@arnt
Copy link
Copy Markdown
Author

arnt commented Sep 16, 2024

I liked the accounts earlier today.

@github-actions
Copy link
Copy Markdown

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@ironprogrammer
Copy link
Copy Markdown
Member

Thanks for the update, @arnt, and for seeking wider testing! There are still some outstanding questions and one previous failure to consider, but it's great to see this moving forward! 🙌🏻

A good next step would be to rebase this PR on the latest trunk, and to resolve the coding standards issues to tidy things up for an updated review and testing. When the CI checks are in a happy state, then it will help attract more attention.

@arnt
Copy link
Copy Markdown
Author

arnt commented Sep 23, 2024

Finding testers who were willing and able to run a patch on sites that mattered to them was hard work. That took a good long while. Do you want their names? If so, please send me mail.

Rebasing is not a problem, of course. The other stuff was a little difficult. I only recently found a good workaround for IntlChar's lack of support for UAX24.

Can you suggest a model end-to-end test that I could emulate and write automatic tests for the desired features? Manual testing is possible, but I'd much prefer to have complete automatic test coverage.

@arnt arnt force-pushed the support-rfc6530 branch 2 times, most recently from cef6fe4 to 28b954e Compare December 3, 2024 14:06
@arnt
Copy link
Copy Markdown
Author

arnt commented Dec 3, 2024

Hi,

the remaining problems, as far as I can tell:

  1. The code now requires PHP 7.4, which I thought was OK but now I see that 7.4 is merely recommended, not required. Oops. Not clear to me how to do mb_str_split in 7.2.
  2. I've clicked around, but don't feel sure that I have all of the "various dropdowns". The filters in default-filters.php are okay, but I don't feel confident that every third-party plugin will be okay, or that I've looked at everything in the UI. Wordpress is large, there are nooks and corners…

@arnt arnt force-pushed the support-rfc6530 branch 2 times, most recently from 79ee473 to 3b0d8e5 Compare December 4, 2024 08:50
@arnt
Copy link
Copy Markdown
Author

arnt commented Dec 4, 2024

I added some code to prevent PHP 7.2.24-7.3.x from choking on mb_str_split().

A bit on testing:

  • ✅ Add New User: email with unicode characters is accepted and saved to the database correctly.
  • ✅ Profile (edit user): email can be changed with unicode characters, and when saved, persists correctly.
  • ✅ New user creation, email change confirmation, and password change emails go to the correct address.
  • ✅ Links in all messages work correctly AFAICT (plugins can add new kinds of course…).
  • ✅ The forgotten password workflow works correctly with a unicode address.
  • ✅ Login: logging in with a unicode email works as expected.
  • ✅ Add post comment: email with unicode is accepted for post comments.
  • ✅ Comments (approval list): emails with unicode are displayed correctly.
  • ✅ In the Comments list, the Email field under Quick Edit does appear correctly.
  • ✅ WordPress works on an IPv6-only server (oh, you weren't wondering?)

And… oh drat… Must retest the site setup wizard, too much code has changed.

Two questions remain.

  1. What author dropdowns display email addresses? I can't find anything. Lots of user names displayed but not email addresses. Maybe I misunderstand. I checked for usage of email in the code that reads the database and use of that code in display code.
  2. How can email addresses ever appear in feeds? The feed generators don't seem to reveal authors etc.

@arnt
Copy link
Copy Markdown
Author

arnt commented Dec 4, 2024

Just set up a new site from scratch using 'rød@grå.org' as admin address; it works.

@arnt
Copy link
Copy Markdown
Author

arnt commented Jan 3, 2025

Hi @ironprogrammer,

would you mind having a look?

Also, on an unrelated matter, I'm with you. I've worked as a full-time maintainer on a widely used open source project and got a lot of flak. It hurt. Some people can be really shitty. Don't let it grind you down.

@ironprogrammer
Copy link
Copy Markdown
Member

Thanks a lot, @arnt, and apologies for not being involved with this the past several months 🙏🏻. I've got some other work priorities at this time, but hope to be able to dive back in soon!

In the meantime, for code review purposes, I would suggest raising this Trac ticket during #core Dev Chat, or dropping in a comment on the next agenda post to add it to open floor discussion. This will help spread some awareness of the initiative. If you're available on Slack and sync during that time to answer questions, that would be ideal.

Glad you could wrangle additional testing for the updates! If this info can be made public through test reports in Trac, that would be amazing -- and also prerequisite to merging as Core committers begin taking a look at the ticket.

You could get additional visibility by sharing the ticket in the #core-test channel. There may also be contributors willing to help put test instructions together based on what you've got so far. This update covers a lot of territory, so the more detailed the tests for each use case, the better.

@arnt arnt force-pushed the support-rfc6530 branch 6 times, most recently from fb001aa to d87d101 Compare June 23, 2025 14:20
Copy link
Copy Markdown
Member

@SirLouen SirLouen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some little code review

@arnt arnt force-pushed the support-rfc6530 branch 2 times, most recently from f0456b1 to 7f3ae71 Compare June 23, 2025 19:19
@dmsnell dmsnell self-assigned this Nov 5, 2025
@SirLouen
Copy link
Copy Markdown
Member

SirLouen commented Nov 5, 2025

@arnt I sent you a PR with some improvements.
Also good to see that @dmsnell has self-assigned this. It will get through soon.

* Verifies that an email is valid.
*
* Does not grok i18n domains. Not RFC compliant.
* The mostly matches what people think is the format of email
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little typo here: **The** mostly matches...

$j = rand( 0, 1 + $hex_encoding );

if ( 0 === $j ) {
if ( ord( $email_address[ $i ] ) > 127 ) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't more clear >= 128 ?
Maybe I would also add a little note here for reference.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s not likely clearer to those who work with text at a low level. 0x7F or 0x80 might be, but it’s still a subjective call. US-ASCII is 0x00–0x7F (0–127) so that’s where the use of this value corresponds to the underlying system.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But >127 saves a byte from the download!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we will need more than one byte to save the world 😅

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But >127 saves a byte from the download!

This claim is not true, of course. When zipped or otherwise compressed we may not be changing the size at all. That would be interesting to know, of course, but not significant, particularly because we’re talking about the server downloading code and not the browser.

It likely doesn’t even save one byte of storage space given that storage is allocated in blocks. If it were to increment the storage space at all, it would probably be 4 kilobytes, or 512 bytes, or whatever the inode/sector/block size is of the device and/or filesystem.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confess that I was not entirely serious about that byte. Should have added a smiley. But oh! I don't have smileys on my keymap! Maybe I ought to join the 21st century and fix that.

Copy link
Copy Markdown
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to leave a partial review, but I am losing network access and wanted to get what I have out the door.

I think there’s a lot we can do to either shrink or grow this change, and I’m worried about rushing it for 6.9.

Revamped support for internationalized email would be a lovely headliner for WordPress 7.0, among others. Coincidentally I pushed out #9313 not too long ago.

I’ll come back later and finish working through this, but perhaps there will be updates before then. Thanks for putting up with all the nitpicks @arnt.

// If strict, reduce to ASCII for max portability.
if ( $strict ) {
// If mixing different scripts, remove all but ASCII.
if ( ! uses_single_unicode_script( $username ) ) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you talk about why you created this function and made the choice to strip non-ASCII, and to do it only when detecting characters from different script families?

One might assume that the goal here is to avoid mixed-script spoofing, but without any explanation it’s just guessing.

If we are trying to prevent spoofing, I might have follow-up questions about that, about the method, and about the approach in this PR. If not, I am very curious to know why it’s here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to reconcile several partially conflictinging goals, safely:

  • allow single-script logins, because the names most people want are based on the real-world names they have, which are single-script.
  • maintain the greatest possible compablity with what the code did before.
  • avoid FAQs

People are so quick to talk about Latin and Cyrillic letter that look like each other. It's an instant objection and tends to derail conversations about unicode and identifiers. So I tried to allow the kind of nicenames that are the goal, maintain philosophical compatibility with existing use of $strict and defuse the instant objection. The table became a little bigger than I liked, but on the other hand it's stopped growing: The unicode consortium isn't adding even more blocks.

I don't actually think spoofing is a priority here. People talk about spoofing all the time and did you knw that Latin and Cyrillic a look like each other. Meanwhile the spoofers use lower-case L and upper-case i, or just don't bother: foobanksecurity432423@hotmail.com is a workable spoof for info@foobank.com.

Finally, I also wanted to be a little safe about the way to get there. I wanted to extend the syntax to what I know is desired now, then maybe go further in a next step. It's easy to extend the set of permissible logins, difficult to restrict it when someone may already use a newly impermissible name.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing. I didn't make the choice to strip non-ASCII, that was in the code already, last changed in 2020. Rather, I tried to keep as much existing behaviour as possible.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was about spoofing; while I would welcome additional thoughts from others, I think most evidence points to spoofing being less of a real risk than some fear it is, so I do question some of the value of taking this unique classification and presenting variant behaviors in cases that aren’t obvious.

Many valid legitimate names involve multiple scripts anyway.

In general I wouldn’t worry too much about predicted “instant objections.” If people do have objections they can share them, but I wouldn’t want anyone to feel pressure to propose things nobody’s is going to end up wanting, out of an abundance of caution…

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with legitimate cross-script use other than a couple of special cases, but if you're willing to drop that function, I'll be happy to push a version without it. Simpler code is nicer code.

$j = rand( 0, 1 + $hex_encoding );

if ( 0 === $j ) {
if ( ord( $email_address[ $i ] ) > 127 ) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s not likely clearer to those who work with text at a low level. 0x7F or 0x80 might be, but it’s still a subjective call. US-ASCII is 0x00–0x7F (0–127) so that’s where the use of this value corresponds to the underlying system.

$j = rand( 0, 1 + $hex_encoding );

if ( 0 === $j ) {
if ( ord( $email_address[ $i ] ) > 127 ) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s possible to debate the value of this function today, but doesn’t this change effectively disable it for non-ASCII characters?

If we make such a large change it would be best to have discussion about it and do it intentionally.

There are tools we can use to work through UTF-8 code points and not have to make this choice. See my comment above on the script detection.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does disable the function for such addresses, and that isn't a problem right now.

ICANN gathers abuse data (primarily to forward it to registrars in real time), I have access to the statistics, and unicode addresses aren't a factor now. There are reasons for that and I'd be glad to talk about it, but whatever the reasons, a function called 'antispambot' can relax about that sort of address. It needs to deal with the effects, not with the reasons.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding antispambot. I'm at the IETF124 meeting this week, and today I presented an internet-draft whose chief goal is robust detection of bad unicode addresses, in various senses of 'bad'. That draft isn't ready to ship in WP yet.

I think you and I might discuss on Slack what approach antispambot could/should take, and I'll contribute a PHP implementation when the draft's at Last Call, if you want it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I suspect this function is entirely useless and might provide as much protection as returning its input unchanged would.

my point is that if we’re going to introduce changes to it as part of this patch, we might want to follow form instead of changing its behavior for some new inputs. Or we could discuss removing all of it, but in a separate and focused discussion

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be changed as part of this patch: I became aware of it because a unit test broke. A breaking unit test is a serious matter.

I too feel that its utiliy is difficult to see. If you don't want it, fine, fine! If you do want something like it, I'd like to contribute something later, without holding up this PR.


if ( version_compare( PHP_VERSION, '7.4.0', '<' ) ) {
// Since mb_str_split is not available in PHP < 7.4 we can only check ASCII characters.
return (bool) preg_match( '/^[a-zA-Z0-9 _.\-@]+$/i', $input );
Copy link
Copy Markdown
Member

@dmsnell dmsnell Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because preg_match() can return false in addition to 0 it’s generally wise to avoid casting its return to boolean. maybe we say in this case it doesn’t matter, but if, say, the PCRE engine does do something weird and fail, we get fuzzy results.

in any case isn’t the return inverted? if the PCRE pattern matches, meaning there are non-ASCII (or ASCII outside of the checked range), this returns true. if, on the other hand, there are only ASCII letters and numbers it will return false. am I missing something? (I saw the ^ inside of the [] instead of where it actually is, on the outside)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point. In this case, however, I think the correct resolution is to delete this code as soon as WP requires PHP 7.4.

The cast is a lesser evil than the poor functionality before PHP 7.4 IMO.

I tried to write a unit test for unexpected failure now, and couldn't. It seems that the only failure mode is regex syntax byut the regex is constant and valid. If someone changes the regex, the existing unit tests will catch the mistake.


$block = 0;
// phpcs:ignore PHPCompatibility.FunctionUse.NewFunctions.mb_str_splitFound -- old versions of PHP are handled above
foreach ( mb_str_split( $input ) as $cp ) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I have questions in this function, I want to be careful about clouding the review with minutia which may or may not matter later. still, because I saw issues with the iteration twice in this proposed patch I wanted to let you know that we do have better tools within WordPress to handle it.

This is ideally a job anyway for grapheme_extract() since that doesn’t preallocate an array for each character in the string. That may not be available, but WordPress 6.9 introduces a private internal utility called _wp_scan_utf8() which provides the fallback for iterating over code points.

I’ve been thinking of polyfilling grapheme_extract() and we can do that if we need to. In the meantime, a decent fallback involves a combination pre-scanning past ASCII bytes and then calling _wp_scan_utf8() for multi-byte code-point iteration.

$at  = 0;
$end = strlen( $input );
while ( $at < $end ) {
	$ascii_span = strspn( $input, " -.@0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\\_abcdefghijklmnopqrstuvwxyz" );
	if ( $ascii_span > 0 ) {
		// has the checked ASCII chars…
	}

	$at += $ascii_span;
	if ( $at >= $end ) {
		break;
	}

	$next_at           = $at;
	$invalid_length    = 0;
	$found_code_points = _wp_scan_utf8(
		$input, 
		$next_at, 
		$invalid_length,
		/* max bytes */ null,
		/* max code points */ 1
	);

	if ( $invalid_length > 0 ) {
		// invalid UTF-8. not necessary to check this if validating the string first
	}

	$code_point = substr( $input, $at, $next_at - $at );
	$at         = $next_at + $invalid_length;
}

Looks like a lot of code, but that’s because it’s intentionally a low-level interface.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grapheme_extract() doesn't actually help here. The function that's really needed wasn't exposed by IntlChar, even though it has to exist, PCRE uses it.

A grapheme is generally a single codepoint in a block such as BLOCK_CODE_LATIN_SOMETHING followed by zero or more combining code points from e.g. COMBINING_DIACRITICAL_MARKS. This code iterates over the code points and ignores those in COMBINING_blah, then checks whether there are code points in different blocks. Splitting into graphemes and then ignoring the COMBINING_blah isn't an improvement.

It would be much better to get the script property from IntlChar. With that, this function would be ten lines or so.

It might be advisable to delete the whole thing later. Although I'm wary about that, as I mentioned people are very quick to talk about homoglyphs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely pleasant to discuss with someone who understands unicode btw.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grapheme_extract() is poorly named. If you read its manual page that I linked you will see that it supports iterating over code points, bytes, and grapheme clusters.

it will provide what you are doing using mb_str_split() but avoids the eager allocation into an array.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. And it even seems to be available in PHP 7.2. Neat! I'll push a new revision that uses that earlyish tomorrow.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyBiryukov in case you missed this thread, I am discouraging the use of mb_str_split() due to its needlessly bloated memory characteristics and due to the fact that we have functions which provide a closer match to the purpose of this code: to scan through code points.


// If strict, remove reduce to letters and numbers.
if ( $strict ) {
$username = preg_replace( '|[^a-z0-9 _.\-@\p{L}\p{N}]|iu', '', $username );
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are many existing cases missing this, but we should check for UTF-8 PCRE support before calling it otherwise this is guaranteed to set $username = false

$username = _wp_can_use_pcre_u()
	? preg_replace( '|…|iu', '', $username )
	: preg_replace( '|[^a-z0-9 _.\-@]|', '', $username );

though it would be possible still to fallback to a manual replacement method if we care that much.

this all highlights how much we could carefully revise all of the assumptions in these functions and think through the implications of what we allow or not

Copy link
Copy Markdown
Author

@arnt arnt Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't bother with any test since existing code assumed that /u is supported. Here is a commit from 2019.

I looked at other usage now to emulate existing fallbacks, and didn't find any fallbacks ;) Not sure how to handle this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s why I was trying to share how it ought to be done. This check has been missed before, but it’s ideal not to add more unchecked code. There’s work in progress to remove all of the unchecked code.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, OK.

I see you even added a way to mock absence. Will push something suitable.

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 7, 2025

I'm grateful for the nitpicks. Seriously. I also wish this could be shipped; that would make life simpler for me.

I'd like to have a talk about growing/shrinking it. I'll keep an eye on Slack tomorrow. Have several meetings though, including one which I chair, so I might be pressed for time, depending on your timezone.

I decided not to rebase/push now. @SirLouen's comment change clearly should go in but it's minor, and I'd rather like to improve that cast but don't see how to write a unit test. I think we can continue talking without a new push.

@SirLouen
Copy link
Copy Markdown
Member

SirLouen commented Nov 7, 2025

I decided not to rebase/push now. @SirLouen's comment change clearly should go in but it's minor, and I'd rather like to improve that cast but don't see how to write a unit test. I think we can continue talking without a new push.

Btw, also while I was reviewing this patch, I sent you a little PR with some little extra formatting tweaks for the tests part. Nothing big, just improving CS a bit and setting a wider group to test this ticket at once.

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 7, 2025

@SirLouen I'll go and have some food now, and merge your PR either when I return or tomorrow morning. I really don't want to break anything due to jetlag and lack of focus.

@dmsnell
Copy link
Copy Markdown
Member

dmsnell commented Nov 7, 2025

jetlag

having just jumped ahead by nine hours, I feel something similar @arnt — hope you recover quickly.


here are some things I reviewed yesterday:

  • WordPress appears to call sanitize_user() everywhere before detecting duplicate usernames, so changing that function should not accidentally allow duplicating user logins.
  • the schema.php indicates a 60-character login and 100-character email. for utf8mb4 configurations that means people will still be able to enter emails with “100 characters”.

I was left with a few more questions in general though:

  • for what reason are we adjusting sanitize_user() if this is about Unicode email addresses? I wouldn’t assume we have to do anything to the user objects.
  • what should we be doing if the database table cannot store the email addresses? for instance, if a table is configured as latin1 we have a problem. while that allows us to store the raw bytes just fine, unless we ensure that PHP and MySQL agree on what’s transferring through the query we will end up with corruption.
  • even if we ensure that we store non-ASCII characters properly, we end up with a length restriction for some email addresses based on the fact that VARCHAR(100) means 100 bytes for a latin1 table but up to 400 bytes for a utf8mb4 table. this is something to consider otherwise people might end up with some confusing error messages when submitting a valid email that looks like it’s shorter than WordPress and the database count it.
CREATE TABLE test_l1 (t varchar(100) character set latin1 collate latin1_swedish_ci) DEFAULT CHARSET=latin1;
$wpdb->query( "INSERT INTO test_latin1 (t) VALUES ('a\u{1f170}b\u{1f171}')" );
$wpdb->query( "INSERT INTO test_latin1 (t) VALUES (_latin1'a\u{1f170}b\u{1f171}')" );
MariaDB [wordpress]> SELECT * FROM test_l1;
+----------------------+
| t                    |
+----------------------+
| a?b?                 |
| a🅰b🅱           |
+----------------------+
2 rows in set (0.000 sec)

In the first case, we sent UTF-8 data to the database, which knows that the table is storing latin1, so it transparently re-encoded from UTF-8 to latin1, replacing unrepresentable characters as ?. In the second case we told the database that the UTF-8 bytes should be explicitly interpreted as latin1 and so it did store the raw bytes, however…

php > var_dump( $wpdb->get_col( 'SELECT t FROM test_l1' ) );
array(2) {
  [0]=>
  string(4) "a?b?"
  [1]=>
  string(20) "a🅰b🅱"
}
php > var_dump( $wpdb->get_col( 'SELECT CAST(t as binary) FROM test_l1' ) );
array(2) {
  [0]=>
  string(4) "a?b?"
  [1]=>
  string(10) "a🅰b🅱"
}

…unless we also do something on the read query side then the database will perform the same transparent re-encoding on read turning those UTF-8 bytes back into UTF-8 and thus double-encoding them.

This seems problematic to me in a new way because after this change we are allowing people to create accounts with Unicode login names and emails, but these database queries won’t fail; they will corrupt the data and someone might lose access to the account they just created, or to an account for which they just changed their email address. If we want to ensure this doesn’t happen we would need to trace the data through all places that access these and ensure we properly manage the character encoding in the SQL session, something that doesn’t tend to go well in WordPress due to the distributed nature of code contributions and plugins.

To me this highlights the need for deprecating non-UTF-8 support (Core-62172), though I guarantee we will be doing whatever we need to to support Unicode emails before we do that.


talk about growing/shrinking it

As named this PR seems very grand in purpose: what does it even mean to support RFC 6530? I feel like as a Core tracking ticket that is appropriate, but the work is likely to involve many independent PRs and changes, including some big semantic changes around usernames and emails, possibly involving some database schema changes.

For now, however, if we assume that this PR should move forward, here are some things I propose, pending further security review:

  • Drop the detection for cross-script spoofing. I would love to see some evidence that the risk is substantial enough that it’s worth the complexity, but I suspect that having rules which change based on the content of the input is going to be confusing for users and developers.
  • Missing from all of the changes are checks that the input is valid UTF-8. If we start by rejecting invalid UTF-8 then we can make a lot of assumptions and drop safety-checks without increasing risk. From my reading of the RFC I don’t see any allowance for addresses with invalid UTF-8.
  • Maintain existing behavior in the antispambot function. That is, let’s not skip non-ASCII characters just because we don’t see them as being present as threats. I think we can probably deprecate that function entirely and skip this, but we will create another discussion for that.

Further, I think we need to talk about what email characters are allowable. Our current set of rejected ASCII character includes octets which could be allowable email addresses. If our intention is to be restrictive, say to reject non-printable characters, or ambiguous whitespace, then we probably ought to see if we can replicate that experience for the upper Unicode planes too and not just turn away from those goals because it’s not ASCII.

Please enjoy your conference. There’s no rush; as this work will probably take a lot of cooperation. We’ll bring it up next week in the developer chat.

@SergeyBiryukov
Copy link
Copy Markdown
Member

@arnt

I added some code to prevent PHP 7.2.24-7.3.x from choking on mb_str_split().

At a glance, would a polyfill for mb_str_split() be helpful here, so that we could remove the function_exists() checks and make the behavior consistent across all supported PHP versions?

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 12, 2025

@SergeyBiryukov I would say that a polyfill makes sense if WP is to support PHP 7.2-3 for a while and a polyfill is simple. The support table makes it look as if support might be dropped in WP 7.0?

@dmsnell that's a long and very welcome message. Do you think I should attend to that now, or leave you in peace until 6.9 is released? I'd really like to merge this soon, for several reasons, but I imagine the 6.9 release needs developer focus.

@dmsnell
Copy link
Copy Markdown
Member

dmsnell commented Nov 12, 2025

Do you think I should attend to that now, or leave you in peace until 6.9 is released? I'd really like to merge this soon, for several reasons, but I imagine the 6.9 release needs developer focus.

@arnt you don’t need to worry about me or my schedule. just like anticipated objections, it will likely be more advantageous to just move at your own pace and let others work at theirs, be it fast or slow.


I am guessing the most important thing right now is figuring out the issue of database corruption since this creates a new avenue for that, and for potentially locking users out of their accounts because of it. we don’t necessarily have to solve it, but it’s a noteworthy impact from this change and creates a problem we didn’t previously have.

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 13, 2025

I'll write several separate comments here, on separate topics, and push a new revision at the end. First, cross-script usernames.

There's effectively no evidence of cross-script usernames being used, either for good or for bad. There's an exception and ha half for good: ① Q and other Latin letters appear to be used a little bit in Chinese, including the well-known name 阿Q, and ② Arabic digits (0-9) are used with many writing systems, even ones that have other digits. (Arabic itself is an example of that, in the areas east of Lebanon/Jordan.)

I know about web forums etc. that allow non-ASCII usernames. Unfortunately for this, the ones I know only allow one script.

There's domains, though. Domains in many TLDs are like self-service usernames, so that experience is instructive. ICANN receives abuse reports for domains and forwards them to registrars; I don't read that feed closely but I think I would hear about examples of cross-script similarity abuse. I haven't so far.

Visually similar letters have been used in real attacks, but I don't know about any cross-script use. Rather, i has been used for l (PaypaI a long time ago, phiIIips recently) and 0 for o, and one clever attack used a little-known diacritic and an all-Latin domain.

Impersonators mostly rely on non-letter similarity as far as I can tell. "Look, my email or web page has your bank's logo so I must be a legit representive of your bank." "My bluesky account has a photo of that person so you can trust that I really am that person." It's similarity but it's not letters.

However, I suspect that if it's possible to impersonate a wordpress user on the same site and use the impersonation for a swindle, someone will eventually do it. It's more likely to be done by copying a profile photo than by leveraging cross-script similarity.

So… I think it's reasonable to drop single_unicode_script, either nor or later. The only substantive reason to keep is that having it from the start is IMO easier than adding it later.

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 13, 2025

About UTF8 and other database encodings.

As I understand it, Wordpress 2.2 started using UTF8 for new databases and 4.2 converted old databases to UTF8, except when admins chose to override that. Overriding is possible to this day. There will be a few sites that do override.

I read ticket 62172 and it makes a great deal of sense to me. Agree about the brittleness. However, that change probably shouldn't be conflated with this.

It seems likely that a non-UTF8 database will suffer corruption in other tables, not just the usernames. That may be less serious. It won't lock anyone out. However, this problem could affect the core value of the site. When you publish a page, saving it to a non-UTF8 database can turn a dozen different currency symbols into question marks. Easy to overlook. Far from good. It seems… less urgent than usernames but worth solving.

Having written all this, I think I've formed an opion:

  • This PR should check roundtrip conversion via DB_CHARSET and refuse to store user names, passwords and email addresses if that fails.
  • If the check fails, the error message should say something like "This name/password/address cannot be stored in the database. Note: DB_CHARSET other than UTF8 is deprecated since 7.0 and will be removed in a future version of Wordpress" without naming a version number.
  • The difference between utf8 and utf8mb4 is best ignored.
  • In theory a string may be roundtrip convertible using PHP but not using MySQL. Best ignored.

Does that sound good?

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 13, 2025

Next up: The scope of this PR.

As @dmsnell asks:

For what reason are we adjusting sanitize_user() if this is about Unicode email addresses? I wouldn’t assume we have to do anything to the user objects."

The first version of the PR did indeed touch only addresses. I extended it by request because as contributor, I try to conform to whatever is expected. At this point I'd prefer to keep the agglomerated PR because:

  • The unit test and review issues are mostly similar.
  • There are test scripts, props and commits that would either be difficult to split, or that would be bulky repetition.

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 13, 2025

Email address length:

Wordpress currently supports 100 bytes reliably, more unreliably. With the change I suggest above it'll store 100 characters reliably or reject the address at entry time.

Email address length limits are a bit of a mess in general. This is partly due to the specifications. The length limit for an email address differs from the length limit for an email address on the public internet. Too subtle for most people. Then there's widespread disobedience. and even worse, there's software and SaaS vendors that judge email addresses and have their own unpublished length limits. If you enter a 70-byte email address in a newsletter signup form, you may well be rejected because a SaaS vendor classifies the address as a bot.

Both 100 bytes and 100 characters are enough that any human who tries to use an address near the limit will run into problems quickly. Much software has shorter limits (a well-known CRM has 80) and few people can spell reliably enough to type a really long address correctly from a business card.

It's sad, I wish the situation were cleaner, but 100 is a safe length in the world as it stands, and there isn't really any number that's obviously right.

FWIW, my name is in both RFC 5321 (length on net) and 5322 (length).

@dmsnell
Copy link
Copy Markdown
Member

dmsnell commented Nov 14, 2025

Does that sound good?

I’d like to hear the thoughts from some others who have more experience dealing with text encoding issues, particularly in places where we create new ones. Perhaps @mdawaffe has some or knows someone with thoughts.

A check for round-tripping seems reasonable and awkward at the same time, because making those checks everywhere would be excessive. For an email address though, maybe it’s worth the diversion.

It's sad, I wish the situation were cleaner, but 100 is a safe length in the world as it stands, and there isn't really any number that's obviously right.

We’re discussing a new reality which has no prior in WordPress and which can be dramatically different. The moment someone uses the flag of England in their email address, that’s 28 characters alone. Meaning that if someone uses the US flag in their email they are allowed 92 other ASCII characters but for the England flag it only leaves 72.

If we only allow storing Unicode emails on database tables with UTF-8 support then we could avoid this problem. Maybe that’s a reasonable feature-gate for this: don’t enable non-ASCII email address unless the {$wpdb->prefix}users table is stored as utf8mb4.

Were we to pursue that angle, it could be worth imagining a new global toggle or check like wp_supports_unicode_email_addresses() or something table-based. I know I have discussed the idea with @adamziel about checking database encodings at startup and caching them so we can reliably work with database tables regardless of the character set or collation. 🤔

I extended it #5237 (comment) because as contributor, I try to conform to whatever is expected.

Perhaps it’s my poor reading skills, but I re-read that comment multiple times and couldn’t find the request you’re referring to. Maybe @ironprogrammer left another comment on an older version of the code?

You are likely underselling your opinions here. This is a collaborative effort and we are all working together; I know from your previous comments that there are times you read more into things than people leave when they comment. It’s okay to ask questions and engage in the back and forth — if we all instantly did whatever anyone asked us then we’d have chaos, or LLM-level-slop.


I’d really love to consider this only in the context of sites with a utf8mb4 users table. I don’t know what to recommend as a way to do that, other than one obvious and necessary-but-insufficient task which is to verify that the emails are valid UTF-8.


There are a few places email addresses are read from query args: I think in the initial site installation and in some new user requests. This is not relatively important compared to the other feedback, but it will be something we want to make sure we check at some point, largely to ensure that when decoded they are valid UTF-8 and that they are properly percent-decoded. Perhaps it’s fine as-is and the email functions will verify that when passed the information (apart from the percent-encoding).

Either way, it’d be great to confirm what happens when some emails are sent with invalid query args.


If anyone would be interested in testing the same steps that others like @USERSATOSHI and @ironprogrammer have performed, but with a site setup with latin1 configured as the blog_charset and as the default database table charset, that would be helpful.

It would be valuable to test emails containing “normal” Unicode like non-latin scripts (which use only up to three bytes) as well as addresses containing characters from the supplemental planes, such as emoji (requiring four bytes).

@mdawaffe
Copy link
Copy Markdown
Contributor

This is a cool proposal. Note that I have not made the time to fully understand the conversation above. I don't think anything I've written below is particularly new: apologies if it is obviously repetitive :)

This PR should check roundtrip conversion via DB_CHARSET and refuse to store user names, passwords and email addresses if that fails.

wpdb->strip_invalid_text() (and friends) does something similar:

  • allows anything in latin1 or binary columns/tables.
  • allows 3-byte UTF-8 in utf8(mb3) columns/tables.
  • allows all UTF-8 in utf8mb4 columns/tables.
  • Does an actual DB query for other encodings.

As you'd imagine from the method name, it strips characters (and optionally truncates the string) rather than bailing. I think, though, you could do a clever SELECT statement that would check charsets against the correct columns/tables, check if the result is different than the input, then bail as necessary before doing the UPDATE/INSERT.

I don't think that's necessary, though, since I think we should instead only allow non-ASCII email addresses in utf8mb4 tables: no roundtrip necessary. Just reject non-valid UTF-8 byte sequences in PHP and check column(/connection?) charsets. Any other DB setups are stuck with ASCII email addresses.

The difference between utf8 and utf8mb4 is best ignored.

I don't think we can ignore the difference. We should require utf8mb4 (supposing we take the direction I mentioned in my previous paragraph).

I don't have a strong opinion about limiting strings to having characters from only one script (plus ASCII). On small sites, it likely doesn't matter at all. On large sites (like WordPress.com or sites where WordPress+bbPress is used for public forums) with user registration available to anyone, spoofing is likely a big concern. I agree that uses_single_unicode_script() is easy to add now, easy to stop using later, and hard to add later, so it might be worth just doing now and reevaluating later. Note that (as you all know) domains are a special case. Many TLDs do not allow IDNs, and some only allow certain scripts. Email addresses contain domains :) I'm not sure if we should care about individual TLD limits. We probably shouldn't care.

I do have one strong opinion: non-WordPress code should be able to make "raw" SELECTs to the database and retrieve the correct username/email address. That is, we should not write "corrupted" or otherwise transformed data to these columns even if WordPress correctly uncorrupts/untransforms the values when it read them. That means no charset shenanigans or hacky encodings (e.g., Base64).

For usernames, another (non-?)option would be to allow any characters, but to reject new usernames that are too "close" to existing ones. I don't know how to do that efficiently without adding a new (indexed) column to the users table, which I think is a bad idea.

Fly-by suggestions:

  • start with email addresses only. I'm reasonably confident that non-ASCII usernames will break something, but I have no clue what :)
  • consider making non-ASCII values opt-in (with a WP filter), so we can get real world experience with sites that want it without altering the behavior of sites that don't care. If it all works: make it opt-out in some future version. (I don't know if we ever do that sort of thing. Perhaps I'm being too conservative.)

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 18, 2025

I just pushed a new revision.

I discovered that wpdb already protects against not being able to store the data. The database may corrupt something, but wpdb already tries really hard to avoid getting that far. I added an extra unit test, since locking people out is worse than most failures.

Side note: That function uses latin1 in a way that looks incompatible with selecting using ilike at first glance. I didn't look closer into that, though.

FWIW I would be unhappy with a toggle for email address support. Those toggles are trouble magnets and I strongly advise against going that way. A toggle for username support is another thing.

I know a data set that can be used for username similarity detection… but that's an unpleasant slippery slope, with much scope for pedantry about details. Some of the decisions in the data set puzzled me. Best avoided, I'd say.

I've no opinion on restricting this to UTF8 databases. If WP were my baby, I'd announce that support for other databases will be sunset in the middlish future and thereby make the smaller question go away, along with the risk caused by brittle code.

I didn't have time to rebase, I'm afraid, but I wanted to get this pushed before I go on a trip tomorrow morning.

@arnt arnt force-pushed the support-rfc6530 branch 2 times, most recently from 4498e43 to 3dc24ca Compare November 19, 2025 08:30
Copy link
Copy Markdown
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my comment is mostly irrelevant to the bigger picture in this PR, but I saw a thing so I figured I’d point it out to not lose it later

Comment on lines +90 to +100
// Email addresses: Allow so long as the database can store them. This
// affects all addresses, including those entered into contact forms.
if ( 'utf8mb4' !== $wpdb->charset ) {
add_filter( 'sanitize_email', 'wp_ascii_without_controls' );
}

// Usernames: Allow if the database can store them. This might be a
// setting instead, so that a site can restrict its own users to ASCII.
if ( 'utf8mb4' !== $wpdb->charset ) {
add_filter( 'sanitize_user', 'wp_ascii_without_controls' );
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Email addresses: Allow so long as the database can store them. This
// affects all addresses, including those entered into contact forms.
if ( 'utf8mb4' !== $wpdb->charset ) {
add_filter( 'sanitize_email', 'wp_ascii_without_controls' );
}
// Usernames: Allow if the database can store them. This might be a
// setting instead, so that a site can restrict its own users to ASCII.
if ( 'utf8mb4' !== $wpdb->charset ) {
add_filter( 'sanitize_user', 'wp_ascii_without_controls' );
}
if ( 'utf8mb4' !== $wpdb->charset ) {
/*
* Email addresses: Allow so long as the database can store them. This
* affects all addresses, including those entered into contact forms.
*/
add_filter( 'sanitize_email', 'wp_ascii_without_controls' );
/*
* Usernames: Allow if the database can store them. This might be a
* setting instead, so that a site can restrict its own users to ASCII.
*/
add_filter( 'sanitize_user', 'wp_ascii_without_controls' );
}

Remove the duplicate conditions and correct the multi-line comment format.

Copy link
Copy Markdown
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it’s been a while since I’ve reviewed this and I don’t have the time to give that it truly deserves, but I’ll try and rehash some of my thoughts the best I can.

first of all, this is great work from @arnt — someone who understands the scope of the problem and is actively championing for its resolution. I want to comment @arnt for that effort.


but I want to back up and reach for earlier comments I left, which I think are the most-crucial questions to answer here. thankfully, one of them is largely already answered.

I discovered that wpdb already protects against not being able to store the data.

but still it seems we are making a number of implicit changes that aren’t getting enough eyes on them:

  • this goes beyond allowing non-ASCII email addresses and imposes new and arbitrary or subjective rules about what kinds of emails are allowed. there is reference to various concepts like Unicode similarity and security concerns, but those policies aren’t technically enforced.
  • there are questions about support on database encodings which cannot store UTF-8 adequately. ironically, a latin1 table can store this stuff just fine but will not play well with collation, search, and equality.

it seems like this is stuck in the middle of being comprehensive and not enough, and I think that could be the result of very diligent and honest attempts to meet the review feedback.

for me personally I could imagine it being easier to analyze at either extreme:

  • do almost nothing other than open up valid-UTF-8 emails.
  • handle all of the easily-foreseeable UX issues created by non-ASCII emails.

I’ll leave a comment in the Trac ticket, but I think this could use some more discussion in the weekly Dev Chat and also a post on Make to gather wider input; there are likely many parties affected by this change.

* A metal umlaut fails because validate_username is
* strict and n̈ is unfamiliar in every language.
*/
$this->assertFalse( validate_username( 'spın̈altap' ) );
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually seems like the kind of username we would intentionally want to support, being a username someone might choose thinking they are clever, but not the kind I would expect from malicious attacks.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it's been used in an attack. The attacker used an accent that's never used in the victims' language, and some victims overlooked the accent. Just a few pixels, not something the eye had practised parsing, easy to overlook.

However. I'm not married to this. I added it for a reason and now you know the reason, but that doesn't make the reason persuasive in Wordpress' context.

public function test_user_corrupted_login() {
global $wpdb;
$rows = $wpdb->update( $wpdb->users, array( 'user_login' => hex2bin( 'c0' ) ), array( 'ID' => $this->author->ID ) );
$this->assertFalse( $rows );
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I appreciate the extremely-helpful comment on this test, the test itself doesn’t seem to assert anything about the usernames; it only seems to be asserting that $wpdb return a failure on an update that MySQL is presumed to reject based on assumptions about the testing environment.

can we find a way to setup the test into the expected failure mode and make assertions about the code under test?

'nonascii@nonascii' => array( 'grå@grå.org', true ),
'decomposed unicode' => array( 'gr\u{0061}\u{030a}blå@grå.org', true ),
'weird but legal dots' => array( '..@example.com', true ),
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we seem to be making assertions about antispambot() here, but we’re also feeding it 100% valid UTF-80and then asserting that the result is valid UTF-8.

is there some case we would expect to fail? how would antispambot() be expected to return non-valid-UTF-8 given valid inputs?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it broke before. The code looked a little adhoc, so I added a few moderately different test cases and made them pass, then left them in. An antispam function might be modified in a hurry... best to leave a few tests.

* Test for invalid characters.
*/
$local = preg_replace( '/[^a-zA-Z0-9!#$%&\'*+\/=?^_`{|}~\.-]/', '', $local );
$local = preg_replace( '/[^a-zA-Z0-9!#$%&\'*+\/=?^_`{|}~\.\x80-\xff-]/', '', $local );
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like the intent behind this change is to open up all bytes for a domain other than the C0 control characters, including space, and also adding backspace/delete.

rather than accept any invalid UTF-8, would it make sense to start by rejecting invalid UTF-8 strings, and then perform a check for the rejected controls characters?

if (
	! wp_is_valid_utf8( $local ) ||
	strcspn( $local, "\x00\x01\x02...\x20\x7F" ) !== strlen( $local )
) {
	return apply_filters( ... );
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. I seem to remember considering something like this, but the function that was there at the time didn't charm me. wp_is_valid_utf8 is so much nicer.

@arnt arnt force-pushed the support-rfc6530 branch 3 times, most recently from 171e2e3 to 172067a Compare February 27, 2026 07:51
…tize_email

This adds support for the unicode address extensions in RFC 6530-3, adds
unit tests for that, extends the documentation to explain the relationship
between this code and the various specifications, and finally adds unit
tests to ensure that the documentation's description of the code remains
correct.

During testing, it became clear that antispambot() worked only for strings
using a single-byte encoding, while this uses UTF8. Fixed.

Fixes #31992.

Props SirLouen, dmsnell, tusharbharti,  mukeshpanchal27.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants