php encoding_error ai_generated true

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 42 in /var/www/app/src/Parser/HtmlSanitizer.php:18

ID: php/domdocument-load-html-entity-warning

Also available as: JSON · Markdown · 中文
78%Fix Rate
83%Confidence
1Evidence
2023-07-22First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
php:8.1.0 active
php:8.2.0 active
php:8.3.0 active

Root Cause

The HTML string passed to DOMDocument::loadHTML() contains a malformed HTML entity (e.g., &nbsp instead of  ), which causes the HTML parser to emit a warning and may result in incomplete parsing.

generic

中文

传递给 DOMDocument::loadHTML() 的 HTML 字符串包含格式错误的 HTML 实体(例如 &nbsp 而不是  ),导致 HTML 解析器发出警告并可能导致解析不完整。

Official Documentation

https://www.php.net/manual/en/domdocument.loadhtml.php

Workarounds

  1. 85% success Pre-process the HTML to fix common malformed entities using a regex: $html = preg_replace('/&(?![a-zA-Z0-9#]+;)/', '&', $html); $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    Pre-process the HTML to fix common malformed entities using a regex: $html = preg_replace('/&(?![a-zA-Z0-9#]+;)/', '&', $html); $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
  2. 80% success Use the LIBXML_NOERROR flag to suppress the warning but still parse the document: $dom->loadHTML($html, LIBXML_NOERROR); however, be aware that this may hide other parsing issues.
    Use the LIBXML_NOERROR flag to suppress the warning but still parse the document: $dom->loadHTML($html, LIBXML_NOERROR); however, be aware that this may hide other parsing issues.
  3. 90% success Use a more forgiving HTML parser like 'html5-php' (masterminds/html5-php) which handles malformed entities gracefully: $html5 = new Masterminds\HTML5(); $dom = $html5->loadHTML($html);
    Use a more forgiving HTML parser like 'html5-php' (masterminds/html5-php) which handles malformed entities gracefully: $html5 = new Masterminds\HTML5(); $dom = $html5->loadHTML($html);

中文步骤

  1. 使用正则表达式预处理 HTML 以修复常见的格式错误的实体:$html = preg_replace('/&(?![a-zA-Z0-9#]+;)/', '&', $html); $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
  2. 使用 LIBXML_NOERROR 标志来抑制警告但仍然解析文档:$dom->loadHTML($html, LIBXML_NOERROR); 但请注意,这可能会隐藏其他解析问题。
  3. 使用更宽容的 HTML 解析器,如 'html5-php' (masterminds/html5-php),它可以优雅地处理格式错误的实体:$html5 = new Masterminds\HTML5(); $dom = $html5->loadHTML($html);

Dead Ends

Common approaches that don't work:

  1. 90% fail

    Suppressing the warning with @ (e.g., @$dom->loadHTML($html)) hides the error but does not fix the malformed entity, which can lead to corrupted DOM trees and unexpected behavior when traversing or querying the document.

  2. 80% fail

    Using htmlspecialchars() on the entire HTML input encodes all ampersands, including those that are part of valid entities (e.g., & becomes &), breaking the HTML structure further.

  3. 90% fail

    Switching to loadXML() instead of loadHTML() causes a fatal error because HTML5 documents with unclosed tags or non-well-formed structures are not valid XML.