Introducing the streaming block parser in WordPress 6.9

WordPress 6.9 introduces the WP_Block_Processor class — a new tool inspired by the HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. and designed for efficiently scanning, understanding, and modifying blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience. structure in HTML documents.

Continue on to learn about this new class, its use-cases, and how you can take advantage of it for more efficient server-side processing.

Introduction

The Block Processor, introduced in #61401, presents a streaming interface to walk through a document and analyze or modify the block structure, providing a structural view of the post while retaining its textual nature.

Consider, for example, counting the types of blocks in a post using the existing parse_blocks() interface.

$blocks = parse_blocks( $post_content );
$counts = array();

while ( null !== ( $block = array_pop( $blocks ) ) ) {
  $type = $block['blockName'];
  $counts[ $type ] = 1 + ( $counts[ $type ] ?? 0 );

  foreach ( $block['innerBlocks'] as $block ) {
    $blocks[] = $block;
  }
}

Calling parse_blocks() converts the entire post into a large nested array of block information, parsed JSONJSON JSON, or JavaScript Object Notation, is a minimal, readable format for structuring data. It is used primarily to transmit data between a server and web application, as an alternative to XML. attributes, and contains two copies of every span of HTML from the input (one for the innerHTML property and the other for the innerContent array). All of this work can lead to some surprisingly heavy1 parsed documents!

Further, because block structure is inherently nested, any code wanting to traverse all of the blocks in a document needs to create its own form of recursion to iterate through the output of parse_blocks(). This code, for instance, uses an array as a stack and performs a breadth-first traversal because it’s easy to write; a decision which results in visiting the blocks out of order.

Lastly, none of this block-type counting can even begin until the entire post has been parsed. It’s not possible to count the types of blocks within the first ten blocks in any given post: it’s all or nothing. What a hassle in terms of memory overhead, runtime, and code complexity just to perform computation that won’t even be used…

The Block Processor was designed for cases where parse_blocks() would otherwise perform unnecessary work.

Here is a version of the above example written with the new Block Processor.

$processor = new WP_Block_Processor( $post_content );
$counts    = array();

while ( $processor->next_block() ) {
  $type = $processor->get_block_type();
  $counts[ $type ] = 1 + ( $counts[ $type ] ?? 0 );
}

There’s no more recursion because the processor visits all of the blocks in a post in a single forward pass, in the order they appear in the document. That is, the Block Processor visits the start of a block, then all of its inner blocks, and their inner blocks, before moving on to the next block2.

What’s less visible in the code is that the processor skipped creating all of the nested arrays, skipped splitting and duplicating the HTML strings, skipped parsing and allocating all of the JSON attributes. It even skipped allocating and normalizing the block names until calling get_block_type().

In the same way that the HTML API makes is possible, the Block Processor can pause and stop parsing. This makes it possible to analyze only the first ten blocks, or only the blocks which start within the first kilobyte of a post, or (if paired with the HTML API) to analyze blocks that exist within the first 1,000 words of a post.

This API opens up new ways to interact with block content.

API overview

General concepts

The Block Processor bridges the textual world of post_content and the structural world of parse_blocks(). It scans through a given input document and stops when it matches spans of text that it recognizes as meaningful block structure or syntax. The spans are called tokens and represent block comment delimiters and the HTML between them.

There are three kinds of block delimiters: opening delimiters open a block, closing delimiters close a block, and void delimiters represent the opening and closing of a block which contains no content. These are not represented in parse_blocks() since it returns the output of fully-parsing a document, and in fact, it’s not possible to determine if a block was a void block when using this function.

HTML tokens are special and come in two varieties: when they appear outside block comment delimiters they are top-level freeform HTML content and represent a special core/freeform block3; when they appear inside block structure they become part of a given block’s innerHTML.

Several of the class methods accept a block type parameter. The block type comprises a namespace and a block name separated by a slash, like core/paragraph. The core namespace is special because it’s implicit when no namespace is provided. For example, a block type of paragraph always represents the fully-qualified block type core/paragraph. Anywhere a block type is passed into the Block Processor, it will represent the fully-qualified form. Anywhere the Block Processor returns a block type it will return the fully-qualified form.

The wildcard * is a special block type. It matches any block type which includes the implicit core/freeform block that emerges with top-level freeform HTML content. These parts of the document are normally skipped in the methods seeking block delimiters because there are no explicit block delimiters for them. Passing the wildcard causes the Block Processor to treat them as if real delimiters did exist4.

Code using this new API will generally follow a basic flow:

  • Create a new Block Processor instance.
  • Scan through the document until finding a token of interest.
  • Query the processor for information about that token.
  • Repeat.

In WordPress 6.9, the Block Processor is read-only. While it provides an indirect mechanism for modifying blocks, its current functionality is focused on analyzing documents and extracting block structure. Expect to see new capabilitiescapability capability is permission to perform one or more types of task. Checking if a user has a capability is performed by the current_user_can function. Each user of a WordPress site might have some permissions but not others, depending on their role. For example, users who have the Author role usually have permission to edit their own posts (the “edit_posts” capability), but not permission to edit other users’ posts (the “edit_others_posts” capability). in future releases.

Method reference

Working with the Block Processor requires creating a new instance by passing in the content containing blocks to the class constructor function.

$processor = new WP_Block_Processor( $html );

Scanning through the document occurs through one of three related methods:

  • next_block() finds the next block comment delimiter which opens a block, or a block of a specific type if one is provided.
  • next_delimiter() finds the next block comment delimiter even if it closes a block, or the next block comment delimiter of a specific type if one is provided.
  • next_token() walks through every syntax token, which will be one of a block delimiter, inner HTML, or top-level freeform HTML content.

These methods all return a boolean value indicating whether they matched. When they do match, it’s possible to query the processor for more information:

  • get_delimiter_type() indicates what kind of delimiter it is, whether it opens or closes a block, represents a void block, or matches HTML.
  • get_block_type() reports the fully-qualified name of the block corresponding to the delimiter. For HTML and freeform content this will be null, just like how parse_blocks() does it.
  • get_printable_block_type() is like get_block_type() but always returns a string5. Inner HTML is identifiable as #innerHTML whereas top-level freeform content is reported as core/freeform.
  • is_block_type() indicates if the delimiter matches the given block type.
  • opens_block() is like is_block_type() but only returns true for opening and void delimiters. As a convenience it accepts multiple block types and indicates if the delimiter opens any one of them, e.g. $processor->opens_block( 'image', 'gallery', 'jetpack/tiled-gallery' ).
  • allocate_and_return_parsed_attributes() returns the block’s JSON attributes. Its name is verbose because it carries the heaviest overhead within the Block Processor6, which attempts to make its costs explicitly known.

The querying methods above are mostly relevant when matched on a block delimiter. These methods provide information when matched on HTML:

  • is_html() indicates whether the matched token is HTML (instead of a block delimiter). This HTML could be part of a block’s inner HTML or it could be top-level freeform content.
  • is_non_whitespace_html() addresses a longstanding inconvenience with parse_blocks(): block serialization adds newlines between blocks and these parse as empty blocks with a null block name. Use this function to know when to skip these pseudo-blocks.
  • get_html_content() returns the actual HTML comprising the token. It returns the raw HTML (meaning that it might contain escaped characters) which can be fed into the HTML API for further processing.

There are also two methods that work together to serve a number of common needs: legacy code often expects a parsed block as input, even if it only works with a single block within a larger post.

  • extract_block() starts at the currently-matched block opener, parses until that block is closed, and returns the same output that parse_blocks() would have produced if it had run on the same span of the input as the processor traversed. This method provides a powerful tool for efficiently manipulating a single block (including all of its inner blocks) within a much larger document.
  • get_span() returns an object representing the part of the input comprising the currently-matched token. For example, when modifying an extracted block, get_span() provides the byte offsets necessary to stitch the serialized update back into the original post — see the example below. Since this method exposes raw string offsets it should be used with caution.

Basic recipes

Determine if a post contains a block of a given type

function contains_block( $html, ...$block_types ) {
    $processor = new WP_Block_Processor( $html );
    while ( $processor->next_block() ) {
        if ( $processor->opens_block( ...$block_types ) ) {
            return true;
        }
    }

    return false;
}

Find the first image block within the first ten blocks

function get_top_image( $html ) {
    $processor = new WP_Block_Processor( $html );
    $remaining = 10;
    while ( $processor->next_block() && --$remaining >= 0 ) {
        if ( $processor->is_block_type( 'image' ) ) {
            return $processor->extract_block();
        }        
    }
}

Add a CSSCSS Cascading Style Sheets. class to the element wrapping inner blocks

This is a more complicated example, but demonstrates how the Block Processor and HTML Processors can work together to safely and reliably perform common operations which are often relegated to less-reliable and less-comprehensible regular expressions.

$block_processor = new WP_Block_Processor( $html );
$wrapping_html_token = null;
$block_processor->next_block( 'next/gallery' );
while ( $block_processor->next_token() && $block_processor->is_html() ) {
    $wrapping_html_token = $block_processor->get_span();
}

// Done with the block processor, work with HTML next.
$wrapping_html = substr( $html, $wrapping_html_token->start, $wrapping_html_token->length );
$html_processor = new WP_HTML_Tag_Processor( $wrapping_html );
while ( $html_processor->next_tag() ) {
    $html_processor->set_bookmark( 'last-tag' );
}
if ( $html_processor->seek( 'last-tag' ) ) {
    $html_processor->add_class( 'wp-is-bleeding-edge' );
}

return (
    substr( $html, 0, $wrapping_html_token->start ) .
    $html_processor->get_updated_html() .
    substr( $html, $wrapping_html_token->start + $wrapping_html_token->length )
);

Future developments

The Block Processor is brand new, but keep watch for updates to its functionality and where it’s used. There are plenty of places in WordPress where taking advantage of the low-overhead and streaming nature of this new API will improve site performance, and where replacing regular-expression and string-based transformations will result in more reliable site behavior.

To share your feedback and insight, please share in the #core-editor channel on SlackSlack Slack is a Collaborative Group Chat Platform https://slack.com/. The WordPress community has its own Slack Channel at https://make.wordpress.org/chat/. or create a new ticket in Trac.

Acknowledgements

Props to @amykamala, @jorbin and @miguelcf for reviewing this post.

  1. Fun fact! The Block Processor’s inspiration was a 3 MB block post which required 14 GB to parse. That document was corrupted in a specific way, but the memory requirements meant that any web server attempting to unpack it would certainly crash out-of-memory. ↩︎
  2. The Block Processor visits blocks in a pre-order traversal, which matches the order in which do_blocks() calls render_block(). ↩︎
  3. “Top-level freeform HTML content” is a mouthful, but also a specific description of non-block content. A block post consists of a tree of blocks, but that tree starts as a list of top-level blocks containing inner blocks. Any content inside of a block is part of that block’s inner content, but it’s still possible to find spans of text or HTML before or after any block has been opened. These non-block spans were initially designed to support backwards-compatibility with legacy non-block posts but are also used for formatting the document when serializing a post. The block editor silently discards top-level freeform HTML content comprising only of whitespace, so most people never see them. However, if they contain actual content then the editor will show them as their own fallback block. ↩︎
  4. The next_token() method will always pause on every token, including top-level freeform HTML content. ↩︎
  5. It’s often surprising when calling parse_blocks() to find a block tree full of blocks with a null name and whose innerHTML is all whitespace, specifically two newlines. These spans of HTML appear because of the default serialization in the block editor, which puts block comment delimiters on their own lines. There are plenty of latent bugs in legacy code assuming that $block['blockName'] will always be a string: get_printable_block_type() was designed for these cases where calling code doesn’t need or want to check for the nullity of the block name, while get_block_type() was designed to maintain compatibility with parse_blocks(). ↩︎
  6. Until allocate_and_return_parsed_attributes() is called, no JSON attributes are parsed — the processor simply takes note of the range of text the attributes span. Once called, the span of text is extracted and parsed as JSON (which could result also in a null value if the span doesn’t decode properly). This design leaves open a future expansion for more-efficient and lazy parsing of the JSON attributes; so if all that’s needed is knowing whether a key exists, or if a specific value is present, the processor would avoid some of the allocations and costs of parsing the entire JSON document. While it is necessary to parse JSON to know where it starts and ends, the block parser has always applied additional constraints on the delimiter syntax so that full JSON parsing isn’t required to properly understand a block post. ↩︎

#6-9, #block-api, #dev-notes, #dev-notes-6-9