Modernizing UTF-8 support in WordPress 6.9

A number of changes in WordPress 6.9 are coming which modernize WordPress’ text encoding and UTF-8 handling. These improvements establish more reliable and consistent text processing across WordPress’ widely-supported environments, benefiting plugins and themes that handle international content, emoji, diacritics, and more.

TL;DR

  • UTF-8 handling in WordPress no longer depends on the running environment thanks to a new fallback pipeline written in pure PHPPHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 7.4 or higher. This means that code which works on your development server will work everywhere it’s deployedDeploy Launching code from a local development environment to the production web server, so that it's available to visitors..
  • Some legacy functions are misleading in their design and are difficult to use properly; these have been deprecated and replaced by more specific and purpose-built alternatives.
    • Prefer wp_is_valid_utf8() instead of seems_utf8().
    • Prefer wp_scrub_utf8() instead of wp_check_invalid_utf8().
    • Prefer mb_convert_encoding() instead of utf8_encode() and utf8_decode().

While humbler than some of the other exciting features in this release, this overhaul of UTF-8 support reflects WordPress’ commitment to a stable and trustworthy internationalized experience.

Read on to learn more about these changes and how to avoid common mistakes and misunderstandings around strings and text.

Feel free to skip ahead to the updates or jump to a tiny guide for working with strings.

Some historic background

It’s a special privilege to not have to think too deeply about how software represents written human language. This ease is the result of most systems standardizing around Unicode and UTF-8 (the process of representing “characters” in bytes). But it wasn’t always this simple.

In WordPress’ early days it was more common to configure computers and servers with regional single-byte encodings1. For example, the byte value 0xA9 (169) might display as © in Gloucester, Љ in Varna, and ฉ in Bangkok. This depended on the system’s localeLocale A locale is a combination of language and regional dialect. Usually locales correspond to countries, as is the case with Portuguese (Portugal) and Portuguese (Brazil). Other examples of locales include Canadian English and U.S. English., which normally defaulted in the factory to latin12, the dominant encoding in the English-speaking and Western European regions. How confusing!

UTF-8 is designed to represent all characters from all languages unambiguously across every platform. But there’s a catch: content created with older systems needed to be converted into UTF-8 to be interpreted correctly. Early PHP and in WordPress code often assumed that text was either written in the local regional encoding or it was already encoded in UTF-8.

This assumption no longer holds, but it remains a source of great consternation. As WordPress and its ecosystem mature in their text-handling capabilitiescapability capability is permission to perform one or more types of task. Checking if a user has a capability is performed by the current_user_can function. Each user of a WordPress site might have some permissions but not others, depending on their role. For example, users who have the Author role usually have permission to edit their own posts (the “edit_posts” capability), but not permission to edit other users’ posts (the “edit_others_posts” capability)., understanding these historical choices is crucial to guide future improvements. This is why functions like seems_utf8() and utf8_encode() are being deprecated.

Updates in WordPress 6.9

Behavioral normalization

UTF-8 support is great on any deployment running with the latest versions of PHP, the mbstring extension, Unicode support for the PCRE engine, and on systems with modern builds of iconv. If any of these components are missing, however, WordPress’ behavior becomes more complicated and less predictable.

Historically, legacy code relied on ad-hoc checks to determine whether specific features were available, such as the following examples:

if ( 1 === @preg_match( '/^./us', $text ) )

if ( function_exists( 'iconv' ) )

These checks led to inconsistent results when full support was missing. Some would give up and return an empty string while others would resort to incomplete, inefficient, and insecure fallbacks.

The new fallback functions escape this inconsistency by relying on a single, spec-compliant UTF-8 parser called _wp_scan_utf8(). In the most-efficient way possible, this function walks through a string and reports specific byte offsets which other fallbacks then use to answer higher level questions, like how long is this string? or where are the spans of invalidinvalid A resolution on the bug tracker (and generally common in software development, sometimes also notabug) that indicates the ticket is not a bug, is a support request, or is generally invalid. UTF-8 bytes in this string?

Inevitably, any solution implemented in pure PHP code will carry additional performance costs3; but this new code only runs when proper UTF‑8 support is missing. In practice, this affects less than 0.5% of all installations. In those rare cases, the new implementation is more efficient than the older fallbacks and is largely non-allocating.

Fortunately, where proper support is available4, WordPress automatically leverages the mbstring extension for optimal performance.

Function deprecation

Some functions have outlived their useful lives while others have always been too error-prone. It’s important to acknowledge this and move on toward more reliable alternatives.

seems_utf8()

This function seems simple enough, but it fails to define what “seems” means and it doesn’t actually check for proper UTF-8 encoding. It attempts to guess whether the bytes in a string are unlikely to be any other encoding than UTF-8, While it performs this task reasonably (though not perfectly5), code that calls it is typically asking a dubious question:

Is this text already UTF‑8, or should it be converted into UTF‑8?

The idea underpinning this question is flawed, assuming that the string’s encoding matches the server’s default or configured encoding. It’s unsound and risky to arbitrarily convert text from blog_charset or any system default encoding into UTF-86 (or into any other encoding for that matter).

seems_utf8() answers the wrong question and should be avoided in favor of functions answering more appropriate questions.

wp_check_invalid_utf8()

This is an unstable function with unclear goals. Its intended purpose is to return a “safe” version of its input for use in UTF-8 contexts. Unfortunately it may refuse, give up, or fail in surprising ways.

It offers to strip out invalid bytes from a UTF-8 input (which is dangerous and should never be done), but if requested to do so will accidentally return false — and it has done that since its inception. Starting with WordPress 6.9, if stripping is requested, it will scrub the spans of invalid bytes instead — see below.

wp_check_invalid_utf8() is fraught and should be avoided in favor of more specific sanitizers.

utf8_encode() and utf8_decode()

These functions have been deprecated in PHP 8.2 and will be removed in PHP 9.0. They are quintessential examples of the broken mentality that a string is either in UTF‑8 already or it should be converted into it from the locally-configured regional encoding; and for these functions, that regional encoding is assumed to be latin1 only.

Their names suggest that more might be going on, that they have some magical encoding-awareness, but offer no hints of just how limited and narrow they are.

To prevent crashes during PHP upgrades, WordPress will conditionally polyfill these functions. However, they are disappearing at large and these polyfills are temporary.

utf8_encode() and utf8_decode() should be avoided in favor of the more-explicitly-named mb_convert_encoding().

New functions and fallbacks

wp_is_valid_utf8()

This function does exactly what its names suggests: it indicates that a string is valid only if it consists entirely of valid UTF‑8 byte sequences.

Note: A string can still validate as UTF‑8 even if it comes from a non‑UTF‑8 system, particularly if the given string contains only ASCII characters, since most encodings are compatible in the ASCII range. This distinction matters when writing strings back into those non‑UTF‑8 systems since they may not be able to read or store UTF‑8 bytes which have been added to them.

wp_scrub_utf8()

Many UTF‑8 strings contain sequences of invalid bytes. These often appear due to improper truncation or slicing7, malicious tampering, or merging with non-UTF‑8 content (e.g. a latin1 blogblog (versus network, site) post inside an RSS feedRSS Feed RSS is an acronym for Real Simple Syndication which is a type of web feed which allows users to access updates to online content in a standardized, computer-readable format. This is the feed.).

In many cases, these invalid sequences are harmless. For instance, browsers display them as the funny question mark �8. But in other cases, such as with XML, a single invalid byte can break the entire document.

The key understanding is that software cannot reliably guess what those invalid bytes should have been, and that removing them invites more risk than leaving them untouched. It’s never safe to remove invalid sequences because doing so risks merging previously separate valid segments into dangerous combinations.

Therefore, when validity is required, the sequences of invalid bytes should be replaced — not removed — by the funny question mark or some other safe and suitable placeholder.

wp_scrub_utf8() performs this substitution securely. However, note that, due to obscure security reasons, the result may contain multiple � character in a row.

Working with strings in various encodings

In general, the strings in PHP, WordPress, and its plugins are all UTF-8 and it’s reasonable to assume this. There are times, however, when it’s necessary to acknowledge that a string might come from some other encoding. This is particularly relevant when working with legacy systems or in sites with a non-UTF-8 value for DB_CHARSET. Knowing whether the assumption is sound can depend on the situation so it’s helpful to always remember that strings in PHP do not contain characters, but rather bytes which encode particular characters.

How do I determine if a provided string is encoded as UTF-8?

While it’s theoretically possible for non-UTF-8 strings to validate as UTF-8, it’s extremely unlikely. The check is therefore easy.

$is_valid_utf8 = wp_is_valid_utf8( $string );

How do I determine whether to encode a string into UTF-8?

Never guess a string’s encoding. Unfortunately, the value of blog_charset doesn’t reveal the encoding of any particular string9. Unless everything is using UTF‑8, assuming an actual encoding is almost certainly wrong.

If the encoding of a string is unknown, reject it and request proper UTF‑8 input or request that the known encoding is transmitted alongside the text.

if ( ! wp_is_valid_utf8( $submitted_comment ) ) {
    return new WP_Error( 'bad_request', 'content MUST be encoded in UTF-8' );
}

This is no small matter. Even with <meta charset="UTF-8"> on a web page, a browser might submit forms in different encodings10, which is why it’s important to add the accept-charset="UTF-8" HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. attribute to every <form> tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.).

Content originates from countless sources: legitimate readers and visitors, various databases, files on different systems, source code, HTTPHTTP HTTP is an acronym for Hyper Text Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. calls, and more.

If it’s not UTF-8 and it’s encoding isn’t 100% certain, don’t touch it!

How do I sanitize UTF-8 strings?

“Sanitizing” is ambiguous and it’s helpful to avoid using the term, just as “safe” is ambiguous. Without knowing the role a string plays and in what context, there is no universal method for sanitization.

In cases where a valid UTF‑8 string is required, it’s best to scrub it of any spans of invalid bytes. For example, when “smart quotes” aren’t properly converted from cp-1252 into UTF-8, they leave invalid bytes, resulting in can�t instead of can’t. This is the result of scrubbing11.

It’s best to avoid working with invalid UTF-8, but if you must, pass it through wp_scrub_utf8() to replace the invalid sequences with safe placeholders.

//              latin1 bytes   surrogate pairs              truncated bytes          
$invalid = "the \x93gig\x94 is \xED\xA9\xBD\xED\xB1\x86 up. \xE2\x80";
echo wp_scrub_utf8( $invalid, 'UTF-8' );
// the �gig� is ������ up. �

Keep in mind too that ensuring a safe encoding of a string doesn’t make it safe in all contexts and doesn’t replace the need to ensure data is safe for the context in which it’s used.

Changelog

Enhancements

  • The default DB_CHARSET in config-sample.php is now utf8mb4. [Core-48285]
  • wp_is_valid_utf8() indicates if a string is valid UTF‑8. [Core-38044, #9317, #9716]
  • sanitize_file_name() now calls _wp_can_use_pcre_u() directly instead of duplicating most of its logic. [Core-63863, #9678]
  • _wp_can_use_pcre_u() no-longer relies on the error-suppression operator. [Core-63865]
  • _wp_scan_utf8() provides a spec-compliant fallback for UTF‑8 functionality. [Core-63863, #9830]
  • wp_scrub_utf8() replaces wp_check_invalid_utf8() by replacing sequences of invalid UTF‑8 bytes with the Unicode Replacement character. [Core-63863, #9498]
  • wp_has_noncharacters() indicates if a string contains Unicode noncharacters, which are valid code points but forbidden from certain protocols. [Core-63863, #9827]

Bug fixes

  • A vendored JavaScriptJavaScript JavaScript or JS is an object-oriented computer programming language commonly used to create interactive effects within web browsers. WordPress makes extensive use of JS for a better user experience. While PHP is executed on the server, JS executes within a user’s browser. https://www.javascript.com/. file and a PHP unit testunit test Code written to test a small piece of code or functionality within a larger application. Everything from themes to WordPress core have a series of unit tests. Also see regression. file were stored as cp1252 and have been re-encoded into UTF‑8. [Core-63815, #9425]
  • The fallback for mb_strlen() now returns the correct count when given invalid UTF‑8 inputs. It is now more efficient and non-allocating. [Core-63863, #9828]
  • The fallback for mb_substr() now returns the correct substrings when given invalid UTF‑8 inputs. It is now more efficient and minimally allocates. [Core-63863, #9829]

Other

  • Fallback support for UTF‑8 loads early before compat.php in its own compat-utf8.php module. [Core-63863, #9825]
  • seems_utf8() is now deprecated. Consider using wp_is_valid_utf8() instead and rejecting non-UTF-8 input if the source encoding is not certain.
  • The HTML APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.’s set_attribute() method now calls wp_has_noncharacters() instead of running a custom PCRE match. [Core-63863, #9798]
  • When missing, WordPress provides deprecated polyfills for utf8_encode() and utf8_decode(). Refactor code calling these functions which are disappearing from PHP. [Core-55603, Core-63863, #10011]

Acknowledgements

Props to @jorbin and @jonsurrell for reviewing this post.

  1. It’s a simplification to say that early computers used single-byte encodings. Many systems, particularly those supporting hundreds or thousands of characters, relied on shifting encoding schemes. These were still single-byte encodings, but each byte could represent different characters depending on which set of characters was active. ↩︎
  2. Most computers were sold with Microsoft’s cp-1252 encoding, which is nearly identical to latin1. It claimed 27 additional characters from the an unassigned region which was historically used for non-printable control codes. Today, ISO-8859-1latin1, and cp-1252 are often used interchangably as synonyms despite their nuanced differences. Alas, this is the source of the smart quotes gremlins that took over the Internet. ↩︎
  3. Surprisingly, the new UTF-8 pipeline runs at near-native speeds for inputs which are mostly US-ASCII characters. ↩︎
  4. The mbstring extension has had its own problems with UTF-8. Where necessary, WordPress runs its fallback code, even when the extension is available, to avoid these issues from causing unexpected behavior. ↩︎
  5. For strings containing spans of invalid UTF-8 bytes, but which are nominally UTF-8, seems_utf8() rejects some while accepting others depending on the kind of invalid bytes. It properly accepts all valid UTF-8 strings. ↩︎
  6. Conversion of text should ideally occur only when the source encoding is known through some explicit indicator. Guessing an encoding will lead to data corruption and may introduce security exploits. ↩︎
  7. Since UTF-8 is a multi-byte encoding format, it’s possible to split a string inside a character. When this happens, the character is corrupted and cannot be recovered. This is the fundamental meaning of invalid in the context of UTF‑8 strings. ↩︎
  8. The “Unicode Replacement Character“ is a real character, U+FFFD. It serves as a placeholder for sequences of invalid text. ↩︎
  9. The blog_charset was most useful in the past when a database stored its text using the server’s configured regional encoding and PHP used the same system encoding when handling strings. Today, databases often perform on-the-fly character-set conversion, so the bytes of text coming out of it may not match what’s stored internally. ↩︎
  10. Common reasons browsers send the “wrong encoding” include: the text encoding for the page is set as a manual override; an HTTP headerHeader The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. specifies a different charset; the <meta> tag ends after the first 1,024 bytes in the HTML document, whereas browsers must limit their search before that offset; and maliciously-crafted query args parameters or POST bodies containing non‑UTF‑8 or invalid UTF‑8 bytes. ↩︎
  11. Failure to convert into UTF‑8 produces a different kind of corruption than when a UTF‑8 string is mistakenly considered cp-1252 and re-encoded or double-encoded into UTF‑8. This produces a valid UTF‑8 string but with damaged text — can’tbecomes can’t. ↩︎

#6-9, #dev-notes, #dev-notes-6-9, #formatting, #utf8