Updates to the HTML API in 6.9

WordPress 6.9 brings an abundance of quiet improvements to the HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.. Updates in this release mostly represent applications of the HTML API to existing code in CoreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress.; these updates increase WordPress’ reliability, improve its security hardening, and reduce maintenance burden on the project.

Major Updates

WP_HTML_Processor​::​serialize_token() is now public.

The HTML Processor’s serialize_token() method returns a fully-normalized and well-formed representation of the currently-matched token. It was introduced in #62036 for WordPress 6.7 as a private method which performs the heavy-lifting for how the HTML API turns “junk” inputs into equivalent well-formed outputs. For example:

$html = '5 < 8 & <tag a=v a="dup"id=di></3>bl&#97rg';
echo WP_HTML_Processor::normalize( $html );
// 5 &lt; 8 &amp; <tag a="v" id="di"><!--3-->blarg</tag>

Its value outside of WP_HTML_Processor::normalize() became evident, however, particularly in the creation of “serialization builders1” which make it possible to modify more of the HTML structure than the HTML Processor itself does. In typical HTML API loops, this method can be used to partially extract portions of the document safely:

// Extract the outerHTML of every paragraph element.
$processor = WP_HTML_Processor::create_fragment( $html );
$content   = '';
while ( $processor->next_tag( 'P' ) ) {
    $content .= $processor->serialize_token();
    $depth    = $processor->get_current_depth();
    while (
        $processor->next_token() &&
        $processor->get_current_depth() > $depth
    ) {
        $content .= $processor->serialize_token();
    }
    $content .= $processor->serialize_token();
    $content .= "\n\n";
}

WordPress understands JavaScriptJavaScript JavaScript or JS is an object-oriented computer programming language commonly used to create interactive effects within web browsers. WordPress makes extensive use of JS for a better user experience. While PHP is executed on the server, JS executes within a user’s browser. https://www.javascript.com/. .dataset properties.

HTML provides a convenient mechanism tying HTML and JavaScript together through the custom data attributes on a tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.). These are the attributes starting with data- like data‑wp‑interactive or data‑post‑id and their values are available on the corresponding Element object in JavaScript through the .dataset property:

<span data-order="Carrots please!">
    What should we order?
</span>
<script>
document.body.addEventListener(
    'click',
    event => alert( event.target.dataset.order )
);
</script>

There are endless ways this integration can be used to add a level of dynamism to a site. Unfortunately, how the name of these attributes is transformed looks simpler than it is. For example, the data‑wp‑bind‑‑class HTML attribute corresponds to the wpBind‑Class dataset property.

To prevent confusion, WordPress 6.9 includes two new functions to map between the HTML and JavaScript names: wp_js_dataset_name() indicates what would appear on the .dataset property in a browser while wp_html_custom_data_attribute_name() indicates what name should be used in HTML to produce the .dataset property of a given name. For example:

// What would this HTML attribute name correspond to in JavaScript?
echo wp_js_dataset_name( 'data-one-two--three---four' );
// oneTwo-Three--Four

// What HTML attribute name is necessary to produce the given JavaScript name?
echo wp_html_custom_data_attribute_name( 'postId.guid' );
// data-post-id.guid

No more hard-coding HTML string assertions in unit tests.

WordPress is full of unit tests asserting specific HTML transformations. The expected outputs for these tests are usually hard-coded and sent to $this->assertSame() to compare against the actual outputs from the code under test. Unfortunately this tends to produce a high rate of false positives because of trivialities like adding an attribute in a different order than was expected, using single-quotes around an attribute value rather than double-quotes, leaving extra whitespace or not enough, or using the mistaken self-closer on an <img> or <br> tag.

When two HTML strings produce the same result in a browser they should pass regardless of their insignificant differences. To ease the development of these kinds of tests and to reduce their false-positive rates, WordPress 6.9 introduces a new method on the WP_UnitTestClass base class: $this->assertEqualHTML().

This new test assertion verifies that two strings are equivalent representations of the same normative HTML. They compare HTML strings semantically, provide more useful output than string comparison when they fail to assert, and they’re even aware of blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience. semantics.

$this->assertEqualHTML(
  "<img src='puppy&period;jpg'   loading=lazy>",
  '<img loading="l&#97zy"src="puppy.jpg"/>'
);
 ✔︎ Is equivalent html

Time: 00:00.038, Memory: 40.00 MB

OK (1 test, 1 assertion)

This test case would pass since the arguments are two equivalent constructions of the same IMG element. However, a few small changes and it succinctly highlights their differences. The addition of the block comment delimiter is for illustrative purposes only.

$this->assertEqualHTML(
    "<!-- wp:image {\"id\":5} --><img src='puppy.jpg' loading=lazy>",
    '<!-- wp:img {"id":6} --><img loading="lazy" data-priority=5 src=puppy.jpg/>'
);
  Is equivalent html
   
    HTML markup was not equivalent.
    Failed asserting that two strings are identical.
    ---·Expected
    +++·Actual
    @@ @@
    -'BLOCK["core/image"]
    +'BLOCK["core/img"]
       {
    -····"id": 5
    +····"id": 6
       }
       <img>
    +····data-priority="5"
         loading="lazy"
    -····src="puppy.jpg"
    +····src="puppy.jpg/"
     '
   
    /WordPress-develop/tests/phpunit/includes/abstract-testcase.php:1235
    /WordPress-develop/tests/phpunit/tests/html/equivalentHtmlTest.php:10
   

Time: 00:00.038, Memory: 40.00 MB

The HTML API received minor updates.

  • The Tag Processor’s constructor will now cast null to an empty string. Similarly, the static creator methods on the HTML Processor will return null instead of an instance of the WP_HTML_Processor class. In each case a _doing_it_wrong() notice will alert developers that these classes expect a string input. This change prevents burying the type errors, which leads to unexpected crashes later on, such as when calling get_updated_html().
  • When calling set_modifiable_text() on a SCRIPT element, updates are rejected if they contain <script or </script in them. This is a conservative measure to avoid entering the script data double escaped state (personal blogblog (versus network, site)) which is prone to misinterpretation.

Full Changelog

Enhancements

  • wp_js_dataset_name() and wp_html_custom_data_attribute_name() map between HTML attributes and the .dataset property in JavaScript. [#61501, PR#9953]
  • The WP_UnitTestClass now contains an assertEqualHTML() method which determines if two strings represent the same normative HTML. [#63527, PR#8882]
  • Multiple length checks are safely skipped when processing SCRIPT content due to an early minimum-length check. [#63738, PR#9230]
  • Encoding detection in METAMeta Meta is a term that refers to the inside workings of a group. For us, this is the team that works on internal WordPress sites like WordCamp Central and Make WordPress. tags is simplified, leading to a minor performance lift. [#63738, PR#9231]
  • WP_HTML_Processor::serialize_token() is now public, making it easier to mix the full safety of the HTML API with outside code modifying and combining HTML. [#63823, PR#9456]
  • The Tag Processor and HTML Processor handle invalidinvalid A resolution on the bug tracker (and generally common in software development, sometimes also notabug) that indicates the ticket is not a bug, is a support request, or is generally invalid. null inputs safely. [#63854, PR#9545]
  • set_modifiable_text() rejects additional contents inside a SCRIPT element when the contents could disturb its normal closing. [#63738, PR#9560]

Bug Fixes

  • Attempting to avoid the HTTP Referer problem, quirks mode is referred to as indicated_compatibility_mode. [#63391, PR#9401]
  • wp_kses() no longer unescapes escaped numeric character references for users without unfiltered_html, preserving more of the actual entered content in a post or comment. [#63630, PR#9099]
  • SCRIPT tags are properly closed in the presence of abruptly-closed HTML comments within the contents, and when the closing SCRIPT tag’s tag name is delimited by a form-feed. [#63738, PR#9397]
  • wp_kses() now allows some previously-missing HTML5 semantic tags and their attributes. [#63786, PR#9379]
  • set_attribute() directly escapes syntax characters into HTML character references to avoid problems with double-escaping logic. This ensures that all values are represented accurately in the resulting HTML. [#64054, PR#10143]

Core refactors

A number of places in Core were updated to benefit from the HTML API.

  • Several of the unit tests now rely on assertEqualHTML(), including for block supports, wp_rel_nofollow(), wp_rel_ugc(), wp_kses, post-filtering, media, oEmbed filtering. [#59622, #63694, PR#5486, PR#9251, PR#9255, PR#9257, PR#9258, PR#9259, PR#9264]
  • get_url_in_content() relies on the Tag Processor to more reliably detect links. Besides improving general HTML parsing, this new version always returns the decoded href attribute, preventing confusion in downstream code. [#63694, PR#9272]
  • Processing for image blocks in classic themes is now performed via the HTML API rather than with PCREs. [#63694, PR#10218]

Acknowledgements

Props to @jonsurrell and @westonruter for reviewing this post.

  1. Methods to replace innerHTML and outerHTML, wrap an element, unwrap an element, insert elements, and more are possible by scanning through a document and conditionally copying the normalized tokens into an output string. ↩︎

#6-9, #dev-notes, #dev-notes-6-9, #html-api