wp_check_invalid_utf8( string $text, bool $strip = false ): string

Checks for invalid UTF8 in a string.

Description

Note! This function only performs its work if the blog_charset is set to UTF-8. For all other values it returns the input text unchanged.

Note! Unless requested, this returns an empty string if the input contains any sequences of invalid UTF-8. To replace invalid byte sequences, pass true as the optional $strip parameter.

Consider using wp_scrub_utf8() instead which does not depend on the value of blog_charset.

Example:

// The `blog_charset` is `latin1`, so this returns the input unchanged.
$every_possible_input === wp_check_invalid_utf8( $every_possible_input );

// Valid strings come through unchanged.
'test' === wp_check_invalid_utf8( 'test' );

$invalid = "the byte xC0 is never allowed in a UTF-8 string.";

// Invalid strings are rejected outright.
'' === wp_check_invalid_utf8( $invalid );

// “Stripping” invalid sequences produces the replacement character instead.
"the byte \u{FFFD} is never allowed in a UTF-8 string." === wp_check_invalid_utf8( $invalid, true );
'the byte � is never allowed in a UTF-8 string.' === wp_check_invalid_utf8( $invalid, true );

Parameters

$textstringrequired: String which is expected to be encoded as UTF-8 unless blog_charset is another encoding.
$stripbooloptional: Whether to replace invalid sequences of bytes with the Unicode replacement character (U+FFFD �). Default false returns an empty string for invalid UTF-8 inputs.
Default:false

Return

string The checked text.

Source

function wp_check_invalid_utf8( $text, $strip = false ) {
	$text = (string) $text;

	if ( 0 === strlen( $text ) ) {
		return '';
	}

	// Store the site charset as a static to avoid multiple calls to get_option().
	static $is_utf8 = null;
	if ( ! isset( $is_utf8 ) ) {
		$is_utf8 = is_utf8_charset();
	}

	if ( ! $is_utf8 || wp_is_valid_utf8( $text ) ) {
		return $text;
	}

	return $strip
		? wp_scrub_utf8( $text )
		: '';
}

View all references View on Trac View on GitHub

Uses	Description
wp_is_valid_utf8()`wp-includes/utf8.php`	Determines if a given byte string represents a valid UTF-8 encoding.
wp_scrub_utf8()`wp-includes/utf8.php`	Replaces ill-formed UTF-8 byte sequences with the Unicode Replacement Character.
is_utf8_charset()`wp-includes/functions.php`	Indicates if a given slug for a character set represents the UTF-8 text encoding. If not provided, examines the current blog’s charset.

Used by	Description
esc_xml()`wp-includes/formatting.php`	Escaping for XML blocks.
_sanitize_text_fields()`wp-includes/formatting.php`	Internal helper function to sanitize a string from user input or from the database.
esc_js()`wp-includes/formatting.php`	Escapes single quotes, `"`, , `&`, and fixes line endings.
esc_html()`wp-includes/formatting.php`	Escaping for HTML blocks.
esc_attr()`wp-includes/formatting.php`	Escaping for HTML attributes.

Changelog

Version	Description
6.9.0	Stripping replaces invalid byte sequences with the Unicode replacement character U+FFFD (�).
2.8.0	Introduced.

User Contributed Notes

You must log in before being able to contribute a note or feedback.