Make WordPress Core

Changeset 60665


Ignore:
Timestamp:
08/26/2025 06:21:00 PM (3 months ago)
Author:
nerrad
Message:

HTML API: Reliably parse HTML in get_url_in_content()

As part of a larger effort in #63694, this utlizes WP_HTML_Tag_Processor instead of regex to parse the string passed into get_url_in_content.

As a benefit this also decodes the URL whereas the previous code didn’t, so strings like http:// will be properly decoded as http://.

Developed in: https://github.com/WordPress/wordpress-develop/pull/9272
Discussed in: https://core.trac.wordpress.org/ticket/63694

Props dmsnell, jonsurrell, nerrad.
Fixes #63694.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/src/wp-includes/formatting.php

    r60630 r60665  
    59795979 * @since 3.6.0
    59805980 *
    5981  * @param string $content A string which might contain a URL.
    5982  * @return string|false The found URL.
     5981 * @param string $content A string which might contain an `A` element with a non-empty `href` attribute.
     5982 * @return string|false Database-escaped URL via {@see esc_url()} if found, otherwise `false`.
    59835983 */
    59845984function get_url_in_content( $content ) {
     
    59875987    }
    59885988
    5989     if ( preg_match( '/<a\s[^>]*?href=([\'"])(.+?)\1/is', $content, $matches ) ) {
    5990         return sanitize_url( $matches[2] );
     5989    $processor = new WP_HTML_Tag_Processor( $content );
     5990    while ( $processor->next_tag( 'A' ) ) {
     5991        $href = $processor->get_attribute( 'href' );
     5992        if ( is_string( $href ) && ! empty( $href ) ) {
     5993            return sanitize_url( $href );
     5994        }
    59915995    }
    59925996
Note: See TracChangeset for help on using the changeset viewer.