csv.Error: field larger than field limit (131072) data encoding_error ai_generated true

CSV parser fails to recognize quoted fields when file starts with UTF-8 BOM

ID: data/csv-utf8-bom-quote-mismatch

Also available as: JSON · Markdown · 中文

90%Fix Rate

87%Confidence

1Evidence

2024-02-14First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
Python 3.11 csv module	active	—	—	—
Apache Commons CSV 1.10.0	active	—	—	—
Pandas 2.1.0	active	—	—	—

Root Cause

UTF-8 BOM (0xEF BB BF) at file start is not stripped by many CSV parsers, causing the first field to include the BOM and breaking quote detection if the field is quoted.

generic

中文

许多 CSV 解析器不会移除文件开头的 UTF-8 BOM（0xEF BB BF），导致第一个字段包含 BOM，如果该字段带引号则会破坏引号检测。

Official Documentation

https://docs.python.org/3/library/csv.html

Workarounds

95% success Strip BOM before CSV parsing: with open('file.csv', 'r', encoding='utf-8-sig') as f: content = f.read().lstrip('\ufeff'); reader = csv.reader(StringIO(content))
```
Strip BOM before CSV parsing: with open('file.csv', 'r', encoding='utf-8-sig') as f: content = f.read().lstrip('\ufeff'); reader = csv.reader(StringIO(content))
```
90% success Use pandas with encoding='utf-8-sig': pd.read_csv('file.csv', encoding='utf-8-sig')
```
Use pandas with encoding='utf-8-sig': pd.read_csv('file.csv', encoding='utf-8-sig')
```

中文步骤

Strip BOM before CSV parsing: with open('file.csv', 'r', encoding='utf-8-sig') as f: content = f.read().lstrip('\ufeff'); reader = csv.reader(StringIO(content))

Use pandas with encoding='utf-8-sig': pd.read_csv('file.csv', encoding='utf-8-sig')

Dead Ends

Common approaches that don't work:

Opening file in 'utf-8-sig' encoding in Python 50% fail
While this removes BOM for text reading, the csv.reader still processes the BOM as part of the first field if not handled before parsing.
Manually skipping first byte with file.seek(3) 60% fail
This only works if BOM is exactly 3 bytes; some editors may add extra bytes, and it breaks for files without BOM.