# 带有UTF-8 BOM的CSV文件导致第一列名称包含\ufeff前缀

- **ID:** `data/csv-encoding-bom-misdetection`
- **领域:** data
- **类别:** encoding
- **验证级别:** ai_generated
- **修复率:** 90%

## 根因

带有UTF-8 BOM（字节顺序标记）的CSV文件在开头包含BOM字节；某些解析器（如Python csv模块、Spark）将BOM视为第一列名称的一部分而不是剥离它。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| Python 3.11 | active | — | — |
| pandas 2.1.0 | active | — | — |
| Spark 3.5.0 | active | — | — |
| Microsoft Excel 365 | active | — | — |

## 解决方案

1. ```
   In pandas, use `pd.read_csv('file.csv', encoding='utf-8-sig')` to automatically strip the BOM on read. Example: `df = pd.read_csv('data.csv', encoding='utf-8-sig')`
   ```
2. ```
   In Spark, use `spark.read.option('encoding', 'UTF-8-BOM').csv('path')` or preprocess with `sed '1s/^\xEF\xBB\xBF//' file.csv > clean.csv`
   ```
3. ```
   In Python, open the file with `open('file.csv', encoding='utf-8-sig')` and pass the file handle to csv.reader: `with open('file.csv', encoding='utf-8-sig') as f: reader = csv.reader(f)`
   ```

## 无效尝试

- **** — Reading with utf-8-sig strips the BOM, but if the file is re-written without specifying encoding, the BOM may reappear or be lost, causing inconsistency. (60% 失败率)
- **** — This is not scalable for large datasets; the BOM will reappear if the file is re-saved from Excel or other tools that add BOM by default. (70% 失败率)
- **** — Python's csv module does not strip the BOM; the first column name will still have \ufeff prefix. (90% 失败率)
