data encoding ai_generated true

带有UTF-8 BOM的CSV文件导致第一列名称包含\ufeff前缀

CSV file with UTF-8 BOM causes first column name to include \ufeff prefix

ID: data/csv-encoding-bom-misdetection

其他格式: JSON · Markdown 中文 · English

90%修复率

85%置信度

1证据数

2023-09-01首次发现

版本兼容性

版本	状态	引入	弃用	备注
Python 3.11	active	—	—	—
pandas 2.1.0	active	—	—	—
Spark 3.5.0	active	—	—	—
Microsoft Excel 365	active	—	—	—

根因分析

带有UTF-8 BOM（字节顺序标记）的CSV文件在开头包含BOM字节；某些解析器（如Python csv模块、Spark）将BOM视为第一列名称的一部分而不是剥离它。

English

CSV files saved with UTF-8 BOM (Byte Order Mark) include the BOM bytes at the start; some parsers (e.g., Python csv module, Spark) treat the BOM as part of the first column name instead of stripping it.

generic

官方文档

https://docs.python.org/3/library/csv.html

解决方案

In pandas, use `pd.read_csv('file.csv', encoding='utf-8-sig')` to automatically strip the BOM on read. Example: `df = pd.read_csv('data.csv', encoding='utf-8-sig')`

In Spark, use `spark.read.option('encoding', 'UTF-8-BOM').csv('path')` or preprocess with `sed '1s/^\xEF\xBB\xBF//' file.csv > clean.csv`

In Python, open the file with `open('file.csv', encoding='utf-8-sig')` and pass the file handle to csv.reader: `with open('file.csv', encoding='utf-8-sig') as f: reader = csv.reader(f)`

无效尝试

常见但无效的做法:

60% 失败
Reading with utf-8-sig strips the BOM, but if the file is re-written without specifying encoding, the BOM may reappear or be lost, causing inconsistency.
70% 失败
This is not scalable for large datasets; the BOM will reappear if the file is re-saved from Excel or other tools that add BOM by default.
90% 失败
Python's csv module does not strip the BOM; the first column name will still have \ufeff prefix.