Skip to content

Conversation

@longwave
Copy link
Contributor

Q A
Branch? 7.4
Bug fix? yes
New feature? no
Deprecations? no
Issues Fix #62236
License MIT

Followup to #62240.

Instead of bailing out immediately on error, we now try harder to create an invalid element by passing any malformed tag into DOMDocument::loadHTML() for it to parse.

} catch (\DOMException) {
continue;
$dom = new \DOMDocument('1.0', $target->encoding);
$dom->loadHTML('<'.$source->tagName.'>', \LIBXML_HTML_NOIMPLIED | \LIBXML_HTML_NODEFDTD);
Copy link

@ndossche ndossche Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully understand what this code is supposed to do.
Are there more of these workarounds? I find this concerning. There are a lot of subtle rules that apply here in the DOM spec. For example, there is the rule about namespaces, there are rules about parser context (which you don't take into account here). What about the namespace of the created tag? What about the uppercase/lowercase wrt the namespace rules of this tag?
TLDR: this looks dangerous. I guess at one point I should make my slides / content public about parser differentials and how they almost always lead to XSS or other types of injection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I had also this feeling in #62240 (comment)

OK to close @longwave?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's close - we can't hope to emulate the exact behaviour of the previous parser in all edge cases, we will work around it in the Drupal test suite instead.

@longwave longwave closed this Nov 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DomCrawler] Native HTML5 parser throws new DOMException on HTML tag containing ampersand

4 participants