data
encoding
ai_generated
true
CSV file with UTF-8 BOM causes first column name to include \ufeff prefix
ID: data/csv-encoding-bom-misdetection
90%Fix Rate
85%Confidence
1Evidence
2023-09-01First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| Python 3.11 | active | — | — | — |
| pandas 2.1.0 | active | — | — | — |
| Spark 3.5.0 | active | — | — | — |
| Microsoft Excel 365 | active | — | — | — |
Root Cause
CSV files saved with UTF-8 BOM (Byte Order Mark) include the BOM bytes at the start; some parsers (e.g., Python csv module, Spark) treat the BOM as part of the first column name instead of stripping it.
generic中文
带有UTF-8 BOM(字节顺序标记)的CSV文件在开头包含BOM字节;某些解析器(如Python csv模块、Spark)将BOM视为第一列名称的一部分而不是剥离它。
Official Documentation
https://docs.python.org/3/library/csv.htmlWorkarounds
-
95% success In pandas, use `pd.read_csv('file.csv', encoding='utf-8-sig')` to automatically strip the BOM on read. Example: `df = pd.read_csv('data.csv', encoding='utf-8-sig')`
In pandas, use `pd.read_csv('file.csv', encoding='utf-8-sig')` to automatically strip the BOM on read. Example: `df = pd.read_csv('data.csv', encoding='utf-8-sig')` -
90% success In Spark, use `spark.read.option('encoding', 'UTF-8-BOM').csv('path')` or preprocess with `sed '1s/^\xEF\xBB\xBF//' file.csv > clean.csv`
In Spark, use `spark.read.option('encoding', 'UTF-8-BOM').csv('path')` or preprocess with `sed '1s/^\xEF\xBB\xBF//' file.csv > clean.csv` -
95% success In Python, open the file with `open('file.csv', encoding='utf-8-sig')` and pass the file handle to csv.reader: `with open('file.csv', encoding='utf-8-sig') as f: reader = csv.reader(f)`
In Python, open the file with `open('file.csv', encoding='utf-8-sig')` and pass the file handle to csv.reader: `with open('file.csv', encoding='utf-8-sig') as f: reader = csv.reader(f)`
中文步骤
In pandas, use `pd.read_csv('file.csv', encoding='utf-8-sig')` to automatically strip the BOM on read. Example: `df = pd.read_csv('data.csv', encoding='utf-8-sig')`In Spark, use `spark.read.option('encoding', 'UTF-8-BOM').csv('path')` or preprocess with `sed '1s/^\xEF\xBB\xBF//' file.csv > clean.csv`In Python, open the file with `open('file.csv', encoding='utf-8-sig')` and pass the file handle to csv.reader: `with open('file.csv', encoding='utf-8-sig') as f: reader = csv.reader(f)`
Dead Ends
Common approaches that don't work:
-
60% fail
Reading with utf-8-sig strips the BOM, but if the file is re-written without specifying encoding, the BOM may reappear or be lost, causing inconsistency.
-
70% fail
This is not scalable for large datasets; the BOM will reappear if the file is re-saved from Excel or other tools that add BOM by default.
-
90% fail
Python's csv module does not strip the BOM; the first column name will still have \ufeff prefix.