Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions src/Symfony/Component/DomCrawler/Crawler.php
Original file line number Diff line number Diff line change
Expand Up @@ -1111,12 +1111,11 @@ private function parseXhtml(string $htmlContent, string $charset = 'UTF-8'): \DO
$internalErrors = libxml_use_internal_errors(true);

$document = \Dom\HTMLDocument::createFromString($htmlContent, \Dom\HTML_NO_DEFAULT_NS, $charset);

libxml_use_internal_errors($internalErrors);

$dom = new \DOMDocument('1.0', $document->inputEncoding);
$this->copyFromHtml5ToDom($document->documentElement, $dom);

libxml_use_internal_errors($internalErrors);

return $dom;
}

Expand Down Expand Up @@ -1275,7 +1274,9 @@ private function copyFromHtml5ToDom(\Dom\Node $source, \DOMDocument $target): vo
try {
$element = $target->createElement($source->tagName);
} catch (\DOMException) {
continue;
$dom = new \DOMDocument('1.0', $target->encoding);
$dom->loadHTML('<'.$source->tagName.'>', \LIBXML_HTML_NOIMPLIED | \LIBXML_HTML_NODEFDTD);
Copy link

@ndossche ndossche Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully understand what this code is supposed to do.
Are there more of these workarounds? I find this concerning. There are a lot of subtle rules that apply here in the DOM spec. For example, there is the rule about namespaces, there are rules about parser context (which you don't take into account here). What about the namespace of the created tag? What about the uppercase/lowercase wrt the namespace rules of this tag?
TLDR: this looks dangerous. I guess at one point I should make my slides / content public about parser differentials and how they almost always lead to XSS or other types of injection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I had also this feeling in #62240 (comment)

OK to close @longwave?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's close - we can't hope to emulate the exact behaviour of the previous parser in all edge cases, we will work around it in the Drupal test suite instead.

$element = $target->importNode($dom->firstChild);
}

foreach ($source->attributes as $attr) {
Expand Down
4 changes: 2 additions & 2 deletions src/Symfony/Component/DomCrawler/Tests/CrawlerTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -1383,8 +1383,8 @@ public static function html5Provider(): iterable
public function testHtml5MalformedContent()
{
$crawler = $this->createCrawler();
$crawler->addHtmlContent('<script&>');
self::assertEquals('<head></head><body></body>', $crawler->html());
$crawler->addHtmlContent('<html><head></head><body><script&>');
self::assertEquals('<head></head><body><script></script></body>', $crawler->html());
}

public function testAlpineJs()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,13 +72,6 @@ public static function invalidHtml5Provider(): iterable
yield 'Text between comments' => ['<!--c--> test <!--cc-->'.$html];
}

public function testHtml5MalformedContent()
{
$crawler = $this->createCrawler();
$crawler->addHtmlContent('<script&>');
self::assertEquals('<head><script></script></head>', $crawler->html());
}

protected function createCrawler($node = null, ?string $uri = null, ?string $baseHref = null)
{
return new Crawler($node, $uri, $baseHref, true);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -95,11 +95,4 @@ public function testAddHtml5()
public function testHtml5ParserParseContentStartingWithValidHeading(string $content)
{
}

public function testHtml5MalformedContent()
{
$crawler = $this->createCrawler();
$crawler->addHtmlContent('<script&>');
self::assertEquals('<head><script></script></head>', $crawler->html());
}
}
Loading