Make WordPress Core

Opened 17 months ago

Last modified 5 weeks ago

#61501 assigned enhancement

KSES Allow more custom data attributes to align with browsers

Reported by: jonsurrell's profile jonsurrell Owned by: dmsnell's profile dmsnell
Milestone: 7.0 Priority: normal
Severity: normal Version: 6.6
Component: Formatting Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

KSES data attribute handling is unnecessarily restrictive. There are a number of valid data attribute that will be handled well and predictably across all browsers that are not allowed by KSES.

KSES rules should be relaxed to allow more data attributes and better aligning with browsers.

Follow up to #61052.

Change History (26)

#1 @sabernhardt
17 months ago

  • Component changed from General to Formatting

#2 @jonsurrell
13 months ago

PR 6492 has a patch for this.

#3 @jonsurrell
13 months ago

#62131 and Gutenberg PR 65803 demonstrate some motivation to allow more data attributes.

The Interactivity API has a "class" directive like this:

<div data-wp-class--highlight="state.highlight"></div>

This adds or removes the "highlight" class on the element depending on the value of the Interactivity API store's state.highlight value.

Tailwind is a popular CSS framework that can make use of the class directive. However, some of the class names that would need to appear as class directives are not allowed by kses:

<?php
echo wp_kses( '<div data-wp-class--top-[3px]="state">', 'post' );
// <div>

#4 @gziolo
13 months ago

  • Milestone changed from Awaiting Review to 6.8
  • Version set to 6.6

This ticket was mentioned in PR #6429 on WordPress/wordpress-develop by @dmsnell.


13 months ago
#5

  • Keywords has-patch has-unit-tests added

Trac ticket: Core-61501.
_(was Core-61052)_
Alternative in #6598

Allow for additional custom data attributes in wp_kses_attr_check().

In this patch the set of allowable custom data attribute names is expanded to permit those including dashes in the dataset name. The term dataset name here means the name appearing to JavaScript in a browser. This is found by stripping away the data- prefix, lowercasing, and then turning every dash-letter combination into a capital letter.

The Interactivity API relies on data attributes of the form data-wp-bind--enabled. The corresponding dataset name here is wpBind-Enabled, and is currently blocked by wp_kses_attr_check(). This patch allows that and any other data attribute through which would include those dashes in the translated name.

This patch is structured in two parts:

  1. Ensure that Core can recognize whether a given attribute represents a custom data attribute.
  2. Once recognizing the custom data attributes, apply a WordPress-specific set of constraints to determine whether to allow it.

The effect of this change is allowing a double-dash when previously only a single dash after the initial data- prefix was allowed. It's more complicated than this, because, as evidenced in the test suite, double dashes translate into different names based on where they appear and what follows them in the attribute name.

<details><summary>Code used to produce this table</summary>

<style>
td, th {
        margin-right: 1em;
}
</style>
<?php

require_once __PATH_TO_REPO__ . '/wordpress-develop/src/wp-load.php';

$variations = [
        'data-',
        'post-id',
        'data-post-id',
        'data-Post-ID',
        'data-uPPer-cAsE',
        "data-\u{2003}",
        'data-🐄-pasture',
        'data-wp-bind--enabled',
        'data-over_under',
        'data--before',
        'data-after-',
        'data-after-after--',
        'data--one--two--',
        'data---one---two---',
        'data-[wish:granted]',
];

echo '<table><tr><th>Attribute Name<th>Dataset Name<th>Before<th>After</tr>';
foreach ( $variations as $name ) {
        $_name    = $name;
        $_value   = 'some-value';
        $_whole   = "{$_name}=\"{$_value}\"";
        $_vless   = 'n';
        $_element = 'div';
        $_allowed_html = [ 'div' => [ 'data-*' => true ] ];

        $was_allowed = wp_old_kses_check( $_name, $_value, $_whole, $_vless, $_element, $_allowed_html );
        $was_allowed = $was_allowed ? '✅' : '🚫';

        $_name    = $name;
        $_value   = 'some-value';
        $_whole   = "{$_name}=\"{$_value}\"";
        $_vless   = 'n';
        $_element = 'div';
        $_allowed_html = [ 'div' => [ 'data-*' => true ] ];

        $is_allowed = wp_kses_attr_check( $_name, $_value, $_whole, $_vless, $_element, $_allowed_html );
        $is_allowed = $is_allowed ? '✅' : '🚫';

        echo <<<HTML
        <tr>
        <th><code>{$name}</code>
        <td style="font-family: monospace;" is-data {$name}>{$name}
        <td>{$was_allowed}
        <td>{$is_allowed}
        HTML;
}
echo '</table>';

?>
<script>
document.querySelectorAll( '[is-data]' ).forEach( node => {
        node.innerText = Object.keys( node.dataset ).join( ', ' );
} );
</script>
<?php

function wp_old_kses_check( &$name, &$value, &$whole, $vless, $element, $allowed_html ) {
        $name_low    = strtolower( $name );
        $element_low = strtolower( $element );

        if ( ! isset( $allowed_html[ $element_low ] ) ) {
                $name  = '';
                $value = '';
                $whole = '';
                return false;
        }

        $allowed_attr = $allowed_html[ $element_low ];

        if ( ! isset( $allowed_attr[ $name_low ] ) || '' === $allowed_attr[ $name_low ] ) {
                /*
                 * Allow `data-*` attributes.
                 *
                 * When specifying `$allowed_html`, the attribute name should be set as
                 * `data-*` (not to be mixed with the HTML 4.0 `data` attribute, see
                 * https://www.w3.org/TR/html40/struct/objects.html#adef-data).
                 *
                 * Note: the attribute name should only contain `A-Za-z0-9_-` chars,
                 * double hyphens `--` are not accepted by WordPress.
                 */
                if ( str_starts_with( $name_low, 'data-' ) && ! empty( $allowed_attr['data-*'] )
                        && preg_match( '/^data(?:-[a-z0-9_]+)+$/', $name_low, $match )
                ) {
                        /*
                         * Add the whole attribute name to the allowed attributes and set any restrictions
                         * for the `data-*` attribute values for the current element.
                         */
                        $allowed_attr[ $match[0] ] = $allowed_attr['data-*'];
                } else {
                        $name  = '';
                        $value = '';
                        $whole = '';
                        return false;
                }
        }

        if ( 'style' === $name_low ) {
                $new_value = safecss_filter_attr( $value );

                if ( empty( $new_value ) ) {
                        $name  = '';
                        $value = '';
                        $whole = '';
                        return false;
                }

                $whole = str_replace( $value, $new_value, $whole );
                $value = $new_value;
        }

        if ( is_array( $allowed_attr[ $name_low ] ) ) {
                // There are some checks.
                foreach ( $allowed_attr[ $name_low ] as $currkey => $currval ) {
                        if ( ! wp_kses_check_attr_val( $value, $vless, $currkey, $currval ) ) {
                                $name  = '';
                                $value = '';
                                $whole = '';
                                return false;
                        }
                }
        }

        return true;
}

</details>

https://github.com/WordPress/wordpress-develop/assets/5431237/15caab7f-1c69-4fe4-9094-f497cfcaf5e9

These are recommendations. If these naming recommendations are not followed, no errors will occur. The attributes will still be matched using CSS attribute selectors, with the attribute being case insensitive and any attribute value being case-sensitive. Attributes not conforming to these three recommendations will also still be recognized by the JavaScript HTMLElement.dataset property and user-agents will include the attribute in the DOMStringMap containing all the custom data attributes for an HTMLElement.

https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/data-*

This ticket was mentioned in Slack in #core by audrasjb. View the logs.


9 months ago

@peterwilsoncc commented on PR #6429:


9 months ago
#7

@dmsnell I took the liberty of merging in trunk to your PR as I know you are short on time at the moment. Are you able to take a quick look to ensure I resolved the merge conflicts correctly?

This ticket was mentioned in Slack in #core by audrasjb. View the logs.


9 months ago

#9 @audrasjb
9 months ago

  • Milestone changed from 6.8 to 6.9

As per today's bug scrub: It appears this ticket is still needs review. As 6.8 beta 1 is very close, I'm moving it to 6.9. Feel free to move it back to 6.8 if it can be committed by Monday.

#10 @jorbin
3 months ago

@dmsnell is this something you are willing to own to help get it in during 6.9?

#11 @dmsnell
3 months ago

@jorbin absolutely.

@audrasjb please accept my apology for not following up after your merge and comment.

since @jonsurrell’s changes are merged, are we talking about extending more broadly all custom data attributes? I think that would be a nice direction to take it as I’m not aware of any concerns with allowing those — browsers do not interpret them.

#12 @dmsnell
2 months ago

I did start to look for ways to audit existing usage and it’s hard 😅

Part of this is still me not fully understanding the threat models about which we are concerned. There’s a kind of paradox here: any custom data attributes which are currently used are already allowed by Core, so auditing existing usage may not reveal much.

At some point I remember discussion about trying to resolve sensitivity issues in the code where they exist. For example, the Interactivity API should be careful to avoid introducing exploits based on its usage of data attributes. Is it reasonable to apply that logic here and open up the data attributes, given that the kind of universal understanding of them is that they are neutral attributes with no predefined meaning or behavior?

Maybe we can draw a parallel to post meta. There could be dangerous interactions with plugins that read post meta, but it’s the responsibility of the plugin generating the content to be safe with it.

---

After having reviewed the linked PR I realize this is different than my memory has it. Why don’t I separate it into one patch which adds the dataset name transformation and another which incorporates that into KSES code?

This reminds of #63804 and I wonder if we have room for a js-compat.php module where we could add functions whose purpose is to harmonize understanding of JavaScript from PHP.

<?php
/**
 * Functions providing PHP-based understanding of JavaScript semantics.
 *
 * @group js-interop
 */

function wp_js_dataset_name( string $html_attribute_name ): ?string {
        
}

function wp_js_trim( string $text ): string {
        
}

function wp_js_strlen( string $text ): int {
        
}

This ticket was mentioned in PR #9953 on WordPress/wordpress-develop by @dmsnell.


2 months ago
#13

Trac ticket: Core-61501
See: #6429

#14 @dmsnell
2 months ago

In #9953 I have proposed adding the new transformer functions as a standalone commit. This separates the act of understanding the custom data attributes from the act of allowing or rejecting them.

Last edited 2 months ago by dmsnell (previous) (diff)

This ticket was mentioned in Slack in #core by welcher. View the logs.


7 weeks ago

#16 @welcher
7 weeks ago

@dmsnell this came up in the 6.9 bug scrub today. Is this something that you'll be able to get in place for 6.9 beta 1?

@dmsnell commented on PR #9953:


5 weeks ago
#17

@peterwilsoncc @westonruter do either of you have thoughts about the structure of this proposed new module? It serves the work opening up custom data attributes by providing the necessary functions to parse them.

But it introduces another module in the top-level of wp-includes too. I think there could be plenty of new JS-interop functions, and there are some Trac tickets open for such things, so I reasoned that it might make sense to collect them in one place.

@westonruter commented on PR #9953:


5 weeks ago
#18

@dmsnell I love this.

In regards to where the functions should be located, to me it seems they may make sense being added to src/wp-includes/script-loader.php. There are other related functions in there.

@dmsnell commented on PR #9953:


5 weeks ago
#19

@westonruter I like this idea, but once I started implementing it I started liking it less.

  • script-loader is full of larger and more integrated functionality for generating HTML on the rendered page.
  • it’s a huge module with over three thousand lines.

my idea with js-interop.php is that it could be full of relatively isolated and pure functions whose purpose is purely to translate between PHP and JavaScript.

I added your changes in bb63d3c, so I would like to know your thoughts after the change and if you think it’s worth the tradeoffs to group it in that existing file.

#20 @westonruter
5 weeks ago

  • Owner set to dmsnell
  • Status changed from new to assigned

#21 @wildworks
5 weeks ago

  • Milestone changed from 6.9 to 7.0

The 6.9 Beta1 release is coming soon, but looking at the PR it seems to still be under discussion. Let's punt it to 7.0.

#22 @dmsnell
5 weeks ago

@wildworks I think we’re going to get at least half of this in. we may adjust it some during the beta, but it’s mostly small edges that are in discussion

@dmsnell commented on PR #9953:


5 weeks ago
#23

@westonruter I’m going to merge as-is, but I would really love some additional thought on this: if we leave it in script-loader.php then it’s not available until a lot of other modules such as post.php and temp†e.php. If those are only providing functions I guess it’s all the same, but I want to double-check that we aren’t inviting issues with calling these functions in theme code before it’s been loaded.

I expect the if we identify problems during the Beta period we can move these back into their own new module.

#24 @dmsnell
5 weeks ago

In 61007:

JS Interop: Add custom data attribute name converter pair.

This patch introduces two new functions: wp_js_dataset_name() and wp_html_custom_data_attribute_name(). Together, these provide reliable mapping between the HTML attribute names for custom data attributes, and the properties found in JavaScript for a given HTMLElement’s .dataset property.

These are to be used where matching names is important, such as in the Interactivity API and when WordPress is deciding whether or not to allow an attribute as a custom data attribute.

Developed in https://github.com/WordPress/wordpress-develop/pull/9953
Discussed in https://core.trac.wordpress.org/ticket/61501

Props dmsnell, westonruter.

See #61501.

#26 @dmsnell
5 weeks ago

@wildworks the rest of this is good for 7.0 — thank you. @peterwilsoncc @jorbin sorry I wasn’t able to get the audit in in time.

one idea I have for this is whether we should consider something a bit coarse as a first step:

  • reject storing URL values, including those with javascript: or data: schemes unless explicitly allowed by the filter.
  • allow everything else.

I did do some brief scanning for existing JS code using custom data attributes to set properties but realized it’s a much harder search than I expected. I’m really tempted to propose again that we simply allow all of them because ultimately it’s the responsibility of the JS code to escape content it writes into JS attributes.

when I try to think of potential abuses I’m struggling to figure out how we could tell the difference between intention or accident or malice and I may just not be in tune enough with this avenue. it seems like we either have complete risk allowing any custom data attributes on save, or the only way to eliminate that risk is to disallow all custom data attributes on save.

but it’s really save vs. rendering code I guess, because, for example, the Interactivity API is going to want to set these and I’m guessing that dynamically-rendered code will pass through wp_kses_post() at some point, probably accidentally, and things will break if not referencing one of the explicitly-allowed custom data attributes.

would appreciate anyone’s help to think this through. either way, the patch is much smaller now with the merge of [61007]

Note: See TracTickets for help on using tickets.