php encoding_error ai_generated true

警告:DOMDocument::loadHTML():htmlParseEntityRef:在实体中期望 ';',行:42,位于 /var/www/app/src/Parser/HtmlSanitizer.php:18

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 42 in /var/www/app/src/Parser/HtmlSanitizer.php:18

ID: php/domdocument-load-html-entity-warning

其他格式: JSON · Markdown 中文 · English
78%修复率
83%置信度
1证据数
2023-07-22首次发现

版本兼容性

版本状态引入弃用备注
php:8.1.0 active
php:8.2.0 active
php:8.3.0 active

根因分析

传递给 DOMDocument::loadHTML() 的 HTML 字符串包含格式错误的 HTML 实体(例如 &nbsp 而不是  ),导致 HTML 解析器发出警告并可能导致解析不完整。

English

The HTML string passed to DOMDocument::loadHTML() contains a malformed HTML entity (e.g., &nbsp instead of  ), which causes the HTML parser to emit a warning and may result in incomplete parsing.

generic

官方文档

https://www.php.net/manual/en/domdocument.loadhtml.php

解决方案

  1. 使用正则表达式预处理 HTML 以修复常见的格式错误的实体:$html = preg_replace('/&(?![a-zA-Z0-9#]+;)/', '&', $html); $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
  2. 使用 LIBXML_NOERROR 标志来抑制警告但仍然解析文档:$dom->loadHTML($html, LIBXML_NOERROR); 但请注意,这可能会隐藏其他解析问题。
  3. 使用更宽容的 HTML 解析器,如 'html5-php' (masterminds/html5-php),它可以优雅地处理格式错误的实体:$html5 = new Masterminds\HTML5(); $dom = $html5->loadHTML($html);

无效尝试

常见但无效的做法:

  1. 90% 失败

    Suppressing the warning with @ (e.g., @$dom->loadHTML($html)) hides the error but does not fix the malformed entity, which can lead to corrupted DOM trees and unexpected behavior when traversing or querying the document.

  2. 80% 失败

    Using htmlspecialchars() on the entire HTML input encodes all ampersands, including those that are part of valid entities (e.g., & becomes &), breaking the HTML structure further.

  3. 90% 失败

    Switching to loadXML() instead of loadHTML() causes a fatal error because HTML5 documents with unclosed tags or non-well-formed structures are not valid XML.