A number of changes in WordPress 6.9 are coming which modernize WordPress’ text encoding and UTF-8 handling. These improvements establish more reliable and consistent text processing across WordPress’ widely-supported environments, benefiting plugins and themes that handle international content, emoji, diacritics, and more.
TL;DR
- UTF-8 handling in WordPress no longer depends on the running environment thanks to a new fallback pipeline written in pure PHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 7.4 or higher. This means that code which works on your development server will work everywhere it’s deployed Launching code from a local development environment to the production web server, so that it's available to visitors..
- Some legacy functions are misleading in their design and are difficult to use properly; these have been deprecated and replaced by more specific and purpose-built alternatives.
- Prefer
wp_is_valid_utf8() instead of seems_utf8().
- Prefer
wp_scrub_utf8() instead of wp_check_invalid_utf8().
- Prefer
mb_convert_encoding() instead of utf8_encode() and utf8_decode().
While humbler than some of the other exciting features in this release, this overhaul of UTF-8 support reflects WordPress’ commitment to a stable and trustworthy internationalized experience.
Read on to learn more about these changes and how to avoid common mistakes and misunderstandings around strings and text.
Feel free to skip ahead to the updates or jump to a tiny guide for working with strings.
Some historic background
It’s a special privilege to not have to think too deeply about how software represents written human language. This ease is the result of most systems standardizing around Unicode and UTF-8 (the process of representing “characters” in bytes). But it wasn’t always this simple.
In WordPress’ early days it was more common to configure computers and servers with regional single-byte encodings1. For example, the byte value 0xA9 (169) might display as © in Gloucester, Љ in Varna, and ฉ in Bangkok. This depended on the system’s locale A locale is a combination of language and regional dialect. Usually locales correspond to countries, as is the case with Portuguese (Portugal) and Portuguese (Brazil). Other examples of locales include Canadian English and U.S. English., which normally defaulted in the factory to latin12, the dominant encoding in the English-speaking and Western European regions. How confusing!
UTF-8 is designed to represent all characters from all languages unambiguously across every platform. But there’s a catch: content created with older systems needed to be converted into UTF-8 to be interpreted correctly. Early PHP and in WordPress code often assumed that text was either written in the local regional encoding or it was already encoded in UTF-8.
This assumption no longer holds, but it remains a source of great consternation. As WordPress and its ecosystem mature in their text-handling capabilities A capability is permission to perform one or more types of task. Checking if a user has a capability is performed by the current_user_can function. Each user of a WordPress site might have some permissions but not others, depending on their role. For example, users who have the Author role usually have permission to edit their own posts (the “edit_posts” capability), but not permission to edit other users’ posts (the “edit_others_posts” capability)., understanding these historical choices is crucial to guide future improvements. This is why functions like seems_utf8() and utf8_encode() are being deprecated.
Updates in WordPress 6.9
Behavioral normalization
UTF-8 support is great on any deployment running with the latest versions of PHP, the mbstring extension, Unicode support for the PCRE engine, and on systems with modern builds of iconv. If any of these components are missing, however, WordPress’ behavior becomes more complicated and less predictable.
Historically, legacy code relied on ad-hoc checks to determine whether specific features were available, such as the following examples:
if ( 1 === @preg_match( '/^./us', $text ) )
if ( function_exists( 'iconv' ) )
These checks led to inconsistent results when full support was missing. Some would give up and return an empty string while others would resort to incomplete, inefficient, and insecure fallbacks.
The new fallback functions escape this inconsistency by relying on a single, spec-compliant UTF-8 parser called _wp_scan_utf8(). In the most-efficient way possible, this function walks through a string and reports specific byte offsets which other fallbacks then use to answer higher level questions, like how long is this string? or where are the spans of invalid A resolution on the bug tracker (and generally common in software development, sometimes also notabug) that indicates the ticket is not a bug, is a support request, or is generally invalid. UTF-8 bytes in this string?
Inevitably, any solution implemented in pure PHP code will carry additional performance costs3; but this new code only runs when proper UTF‑8 support is missing. In practice, this affects less than 0.5% of all installations. In those rare cases, the new implementation is more efficient than the older fallbacks and is largely non-allocating.
Fortunately, where proper support is available4, WordPress automatically leverages the mbstring extension for optimal performance.
Function deprecation
Some functions have outlived their useful lives while others have always been too error-prone. It’s important to acknowledge this and move on toward more reliable alternatives.
seems_utf8()
This function seems simple enough, but it fails to define what “seems” means and it doesn’t actually check for proper UTF-8 encoding. It attempts to guess whether the bytes in a string are unlikely to be any other encoding than UTF-8, While it performs this task reasonably (though not perfectly5), code that calls it is typically asking a dubious question:
Is this text already UTF‑8, or should it be converted into UTF‑8?
The idea underpinning this question is flawed, assuming that the string’s encoding matches the server’s default or configured encoding. It’s unsound and risky to arbitrarily convert text from blog_charset or any system default encoding into UTF-86 (or into any other encoding for that matter).
seems_utf8() answers the wrong question and should be avoided in favor of functions answering more appropriate questions.
wp_check_invalid_utf8()
This is an unstable function with unclear goals. Its intended purpose is to return a “safe” version of its input for use in UTF-8 contexts. Unfortunately it may refuse, give up, or fail in surprising ways.
It offers to strip out invalid bytes from a UTF-8 input (which is dangerous and should never be done), but if requested to do so will accidentally return false — and it has done that since its inception. Starting with WordPress 6.9, if stripping is requested, it will scrub the spans of invalid bytes instead — see below.
wp_check_invalid_utf8() is fraught and should be avoided in favor of more specific sanitizers.
utf8_encode() and utf8_decode()
These functions have been deprecated in PHP 8.2 and will be removed in PHP 9.0. They are quintessential examples of the broken mentality that a string is either in UTF‑8 already or it should be converted into it from the locally-configured regional encoding; and for these functions, that regional encoding is assumed to be latin1 only.
Their names suggest that more might be going on, that they have some magical encoding-awareness, but offer no hints of just how limited and narrow they are.
To prevent crashes during PHP upgrades, WordPress will conditionally polyfill these functions. However, they are disappearing at large and these polyfills are temporary.
utf8_encode() and utf8_decode() should be avoided in favor of the more-explicitly-named mb_convert_encoding().
New functions and fallbacks
wp_is_valid_utf8()
This function does exactly what its names suggests: it indicates that a string is valid only if it consists entirely of valid UTF‑8 byte sequences.
Note: A string can still validate as UTF‑8 even if it comes from a non‑UTF‑8 system, particularly if the given string contains only ASCII characters, since most encodings are compatible in the ASCII range. This distinction matters when writing strings back into those non‑UTF‑8 systems since they may not be able to read or store UTF‑8 bytes which have been added to them.
wp_scrub_utf8()
Many UTF‑8 strings contain sequences of invalid bytes. These often appear due to improper truncation or slicing7, malicious tampering, or merging with non-UTF‑8 content (e.g. a latin1 blog (versus network, site) post inside an RSS feed RSS is an acronym for Real Simple Syndication which is a type of web feed which allows users to access updates to online content in a standardized, computer-readable format. This is the feed.).
In many cases, these invalid sequences are harmless. For instance, browsers display them as the funny question mark �8. But in other cases, such as with XML, a single invalid byte can break the entire document.
The key understanding is that software cannot reliably guess what those invalid bytes should have been, and that removing them invites more risk than leaving them untouched. It’s never safe to remove invalid sequences because doing so risks merging previously separate valid segments into dangerous combinations.
Therefore, when validity is required, the sequences of invalid bytes should be replaced — not removed — by the funny question mark or some other safe and suitable placeholder.
wp_scrub_utf8() performs this substitution securely. However, note that, due to obscure security reasons, the result may contain multiple � character in a row.
Working with strings in various encodings
In general, the strings in PHP, WordPress, and its plugins are all UTF-8 and it’s reasonable to assume this. There are times, however, when it’s necessary to acknowledge that a string might come from some other encoding. This is particularly relevant when working with legacy systems or in sites with a non-UTF-8 value for DB_CHARSET. Knowing whether the assumption is sound can depend on the situation so it’s helpful to always remember that strings in PHP do not contain characters, but rather bytes which encode particular characters.
How do I determine if a provided string is encoded as UTF-8?
While it’s theoretically possible for non-UTF-8 strings to validate as UTF-8, it’s extremely unlikely. The check is therefore easy.
$is_valid_utf8 = wp_is_valid_utf8( $string );
How do I determine whether to encode a string into UTF-8?
Never guess a string’s encoding. Unfortunately, the value of blog_charset doesn’t reveal the encoding of any particular string9. Unless everything is using UTF‑8, assuming an actual encoding is almost certainly wrong.
If the encoding of a string is unknown, reject it and request proper UTF‑8 input or request that the known encoding is transmitted alongside the text.
if ( ! wp_is_valid_utf8( $submitted_comment ) ) {
return new WP_Error( 'bad_request', 'content MUST be encoded in UTF-8' );
}
This is no small matter. Even with <meta charset="UTF-8"> on a web page, a browser might submit forms in different encodings10, which is why it’s important to add the accept-charset="UTF-8" HTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. attribute to every <form> tag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.).
Content originates from countless sources: legitimate readers and visitors, various databases, files on different systems, source code, HTTP HTTP is an acronym for Hyper Text Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. calls, and more.
If it’s not UTF-8 and it’s encoding isn’t 100% certain, don’t touch it!
How do I sanitize UTF-8 strings?
“Sanitizing” is ambiguous and it’s helpful to avoid using the term, just as “safe” is ambiguous. Without knowing the role a string plays and in what context, there is no universal method for sanitization.
In cases where a valid UTF‑8 string is required, it’s best to scrub it of any spans of invalid bytes. For example, when “smart quotes” aren’t properly converted from cp-1252 into UTF-8, they leave invalid bytes, resulting in can�t instead of can’t. This is the result of scrubbing11.
It’s best to avoid working with invalid UTF-8, but if you must, pass it through wp_scrub_utf8() to replace the invalid sequences with safe placeholders.
// latin1 bytes surrogate pairs truncated bytes
$invalid = "the \x93gig\x94 is \xED\xA9\xBD\xED\xB1\x86 up. \xE2\x80";
echo wp_scrub_utf8( $invalid, 'UTF-8' );
// the �gig� is ������ up. �
Keep in mind too that ensuring a safe encoding of a string doesn’t make it safe in all contexts and doesn’t replace the need to ensure data is safe for the context in which it’s used.
Changelog
Enhancements
- The default
DB_CHARSET in config-sample.php is now utf8mb4. [Core-48285]
wp_is_valid_utf8() indicates if a string is valid UTF‑8. [Core-38044, #9317, #9716]
sanitize_file_name() now calls _wp_can_use_pcre_u() directly instead of duplicating most of its logic. [Core-63863, #9678]
_wp_can_use_pcre_u() no-longer relies on the error-suppression operator. [Core-63865]
_wp_scan_utf8() provides a spec-compliant fallback for UTF‑8 functionality. [Core-63863, #9830]
wp_scrub_utf8() replaces wp_check_invalid_utf8() by replacing sequences of invalid UTF‑8 bytes with the Unicode Replacement character. [Core-63863, #9498]
wp_has_noncharacters() indicates if a string contains Unicode noncharacters, which are valid code points but forbidden from certain protocols. [Core-63863, #9827]
Bug fixes
- A vendored JavaScript JavaScript or JS is an object-oriented computer programming language commonly used to create interactive effects within web browsers. WordPress makes extensive use of JS for a better user experience. While PHP is executed on the server, JS executes within a user’s browser. https://www.javascript.com/. file and a PHP unit test Code written to test a small piece of code or functionality within a larger application. Everything from themes to WordPress core have a series of unit tests. Also see regression. file were stored as
cp1252 and have been re-encoded into UTF‑8. [Core-63815, #9425]
- The fallback for
mb_strlen() now returns the correct count when given invalid UTF‑8 inputs. It is now more efficient and non-allocating. [Core-63863, #9828]
- The fallback for
mb_substr() now returns the correct substrings when given invalid UTF‑8 inputs. It is now more efficient and minimally allocates. [Core-63863, #9829]
Other
- Fallback support for UTF‑8 loads early before
compat.php in its own compat-utf8.php module. [Core-63863, #9825]
seems_utf8() is now deprecated. Consider using wp_is_valid_utf8() instead and rejecting non-UTF-8 input if the source encoding is not certain.
- The HTML API An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.’s
set_attribute() method now calls wp_has_noncharacters() instead of running a custom PCRE match. [Core-63863, #9798]
- When missing, WordPress provides deprecated polyfills for
utf8_encode() and utf8_decode(). Refactor code calling these functions which are disappearing from PHP. [Core-55603, Core-63863, #10011]
Acknowledgements
Props to @jorbin and @jonsurrell for reviewing this post.
#6-9, #dev-notes, #dev-notes-6-9, #formatting, #utf8