UnicodeDecodeError data encoding_error ai_generated true

CSV parsing error: UnicodeDecodeError with 'charmap' codec when reading ISO-8859-1 encoded file as UTF-8

ID: data/csv-encoding-iso-8859-1-vs-utf-8

Also available as: JSON · Markdown · 中文

95%Fix Rate

90%Confidence

1Evidence

2023-08-22First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
Python 3.10+	active	—	—	—
pandas 2.0+	active	—	—	—
Python csv module	active	—	—	—

Root Cause

A CSV file encoded in ISO-8859-1 (Latin-1) contains byte sequences invalid in UTF-8 (e.g., accented characters like 'é' or 'ñ'), causing the default UTF-8 decoder to raise a UnicodeDecodeError.

generic

中文

一个以ISO-8859-1（Latin-1）编码的CSV文件包含在UTF-8中无效的字节序列（例如重音字符如'é'或'ñ'），导致默认的UTF-8解码器引发UnicodeDecodeError。

Official Documentation

https://docs.python.org/3/library/csv.html#csv.reader

Workarounds

90% success Detect and specify the correct encoding. Use `chardet` to auto-detect: `import chardet; with open('file.csv', 'rb') as f: result = chardet.detect(f.read(10000)); encoding = result['encoding']`. Then read with `pandas.read_csv('file.csv', encoding=encoding)`.
```
Detect and specify the correct encoding. Use `chardet` to auto-detect: `import chardet; with open('file.csv', 'rb') as f: result = chardet.detect(f.read(10000)); encoding = result['encoding']`. Then read with `pandas.read_csv('file.csv', encoding=encoding)`.
```
95% success Convert the file to UTF-8 using `iconv` command: `iconv -f ISO-8859-1 -t UTF-8 original.csv > converted.csv`. Then read the converted file with default UTF-8 encoding.
```
Convert the file to UTF-8 using `iconv` command: `iconv -f ISO-8859-1 -t UTF-8 original.csv > converted.csv`. Then read the converted file with default UTF-8 encoding.
```
100% success Read with `encoding='ISO-8859-1'` in pandas: `df = pd.read_csv('file.csv', encoding='ISO-8859-1')`.
```
Read with `encoding='ISO-8859-1'` in pandas: `df = pd.read_csv('file.csv', encoding='ISO-8859-1')`.
```

中文步骤

检测并指定正确的编码。使用`chardet`自动检测：`import chardet; with open('file.csv', 'rb') as f: result = chardet.detect(f.read(10000)); encoding = result['encoding']`。然后使用`pandas.read_csv('file.csv', encoding=encoding)`读取。

使用`iconv`命令将文件转换为UTF-8：`iconv -f ISO-8859-1 -t UTF-8 original.csv > converted.csv`。然后用默认UTF-8编码读取转换后的文件。

在pandas中使用`encoding='ISO-8859-1'`读取：`df = pd.read_csv('file.csv', encoding='ISO-8859-1')`。

Dead Ends

Common approaches that don't work:

90% fail
Ignoring errors silently drops characters, leading to data corruption. For example, 'José' becomes 'Jos'.
50% fail
Notepad++ may misinterpret the original encoding if auto-detect is wrong, or double-encode characters, producing mojibake.
70% fail
Excel may add a BOM, change delimiter to semicolon based on locale, or truncate leading zeros in numeric fields.