csv.Error: field larger than field limit (131072) data encoding_error ai_generated true

当文件以 UTF-8 BOM 开头时，CSV 解析器无法识别带引号的字段

CSV parser fails to recognize quoted fields when file starts with UTF-8 BOM

ID: data/csv-utf8-bom-quote-mismatch

其他格式: JSON · Markdown 中文 · English

90%修复率

87%置信度

1证据数

2024-02-14首次发现

版本兼容性

版本	状态	引入	弃用	备注
Python 3.11 csv module	active	—	—	—
Apache Commons CSV 1.10.0	active	—	—	—
Pandas 2.1.0	active	—	—	—

根因分析

许多 CSV 解析器不会移除文件开头的 UTF-8 BOM（0xEF BB BF），导致第一个字段包含 BOM，如果该字段带引号则会破坏引号检测。

English

UTF-8 BOM (0xEF BB BF) at file start is not stripped by many CSV parsers, causing the first field to include the BOM and breaking quote detection if the field is quoted.

generic

官方文档

https://docs.python.org/3/library/csv.html

解决方案

Strip BOM before CSV parsing: with open('file.csv', 'r', encoding='utf-8-sig') as f: content = f.read().lstrip('\ufeff'); reader = csv.reader(StringIO(content))

Use pandas with encoding='utf-8-sig': pd.read_csv('file.csv', encoding='utf-8-sig')

无效尝试

常见但无效的做法:

Opening file in 'utf-8-sig' encoding in Python 50% 失败
While this removes BOM for text reading, the csv.reader still processes the BOM as part of the first field if not handled before parsing.
Manually skipping first byte with file.seek(3) 60% 失败
This only works if BOM is exactly 3 bytes; some editors may add extra bytes, and it breaks for files without BOM.