data data_error ai_generated true

CSV parser silently trims leading/trailing whitespace from quoted fields

ID: data/csv-whitespace-trimming

Also available as: JSON · Markdown · 中文
85%Fix Rate
86%Confidence
1Evidence
2024-01-12First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
pandas 2.0.0 active
Python csv module 3.11 active
Apache Spark 3.4.0 active

Root Cause

Many CSV parsers (e.g., pandas read_csv, Excel) trim whitespace from quoted fields by default, but some do not, causing data inconsistency between systems.

generic

中文

许多CSV解析器(例如pandas read_csv、Excel)默认从带引号的字段中删除空白,但有些不会,导致系统间数据不一致。

Official Documentation

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Workarounds

  1. 95% success Use pandas with skipinitialspace=False: df = pd.read_csv('file.csv', skipinitialspace=False)
    Use pandas with skipinitialspace=False: df = pd.read_csv('file.csv', skipinitialspace=False)
  2. 90% success Wrap fields in quotes and use a parser that preserves whitespace: csv.reader(csvfile, skipinitialspace=False)
    Wrap fields in quotes and use a parser that preserves whitespace: csv.reader(csvfile, skipinitialspace=False)

中文步骤

  1. Use pandas with skipinitialspace=False: df = pd.read_csv('file.csv', skipinitialspace=False)
  2. Wrap fields in quotes and use a parser that preserves whitespace: csv.reader(csvfile, skipinitialspace=False)

Dead Ends

Common approaches that don't work:

  1. Setting quoting=csv.QUOTE_NONE in Python's csv module 85% fail

    This disables all quoting and may break fields containing commas.

  2. Adding a post-processing step to re-add whitespace based on original file 70% fail

    Does not affect how the CSV is parsed, only how data is validated.