UnicodeDecodeError data encoding_error ai_generated true

CSV解析错误:将ISO-8859-1编码文件作为UTF-8读取时出现'charmap'编解码器的UnicodeDecodeError

CSV parsing error: UnicodeDecodeError with 'charmap' codec when reading ISO-8859-1 encoded file as UTF-8

ID: data/csv-encoding-iso-8859-1-vs-utf-8

其他格式: JSON · Markdown 中文 · English
95%修复率
90%置信度
1证据数
2023-08-22首次发现

版本兼容性

版本状态引入弃用备注
Python 3.10+ active
pandas 2.0+ active
Python csv module active

根因分析

一个以ISO-8859-1(Latin-1)编码的CSV文件包含在UTF-8中无效的字节序列(例如重音字符如'é'或'ñ'),导致默认的UTF-8解码器引发UnicodeDecodeError。

English

A CSV file encoded in ISO-8859-1 (Latin-1) contains byte sequences invalid in UTF-8 (e.g., accented characters like 'é' or 'ñ'), causing the default UTF-8 decoder to raise a UnicodeDecodeError.

generic

官方文档

https://docs.python.org/3/library/csv.html#csv.reader

解决方案

  1. 检测并指定正确的编码。使用`chardet`自动检测:`import chardet; with open('file.csv', 'rb') as f: result = chardet.detect(f.read(10000)); encoding = result['encoding']`。然后使用`pandas.read_csv('file.csv', encoding=encoding)`读取。
  2. 使用`iconv`命令将文件转换为UTF-8:`iconv -f ISO-8859-1 -t UTF-8 original.csv > converted.csv`。然后用默认UTF-8编码读取转换后的文件。
  3. 在pandas中使用`encoding='ISO-8859-1'`读取:`df = pd.read_csv('file.csv', encoding='ISO-8859-1')`。

无效尝试

常见但无效的做法:

  1. 90% 失败

    Ignoring errors silently drops characters, leading to data corruption. For example, 'José' becomes 'Jos'.

  2. 50% 失败

    Notepad++ may misinterpret the original encoding if auto-detect is wrong, or double-encode characters, producing mojibake.

  3. 70% 失败

    Excel may add a BOM, change delimiter to semicolon based on locale, or truncate leading zeros in numeric fields.