ArrowNotImplementedError data type_error ai_generated true

Parquet decimal precision overflow when reading into pandas

ID: data/parquet-decimal-overflow

Also available as: JSON · Markdown · 中文
85%Fix Rate
83%Confidence
1Evidence
2024-01-10First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
pyarrow 12.0.0 active
pyarrow 14.0.1 active
pandas 2.2.0 active

Root Cause

Parquet files store decimals with arbitrary precision (e.g., decimal(38,10)), but pandas converts them to float64 by default, causing overflow or precision loss for values exceeding float64 capacity.

generic

中文

Parquet文件以任意精度存储十进制数(例如decimal(38,10)),但pandas默认将其转换为float64,导致超过float64容量的值溢出或精度丢失。

Official Documentation

https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html

Workarounds

  1. 90% success Read with pyarrow and specify decimal type: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); from decimal import Decimal; df = table.to_pandas(types_mapper={pa.decimal128(38,10): Decimal})`
    Read with pyarrow and specify decimal type: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); from decimal import Decimal; df = table.to_pandas(types_mapper={pa.decimal128(38,10): Decimal})`
  2. 82% success Use pandas read_parquet with dtype_backend='pyarrow': `df = pd.read_parquet('data.parquet', dtype_backend='pyarrow')`
    Use pandas read_parquet with dtype_backend='pyarrow': `df = pd.read_parquet('data.parquet', dtype_backend='pyarrow')`

中文步骤

  1. Read with pyarrow and specify decimal type: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); from decimal import Decimal; df = table.to_pandas(types_mapper={pa.decimal128(38,10): Decimal})`
  2. Use pandas read_parquet with dtype_backend='pyarrow': `df = pd.read_parquet('data.parquet', dtype_backend='pyarrow')`

Dead Ends

Common approaches that don't work:

  1. 70% fail

    This only preserves pandas-specific metadata like index names; it does not change the decimal-to-float conversion behavior.

  2. 85% fail

    The overflow already occurred during reading; the string representation will show the truncated/rounded value.

  3. 75% fail

    Fastparquet has the same limitation; it also converts decimals to float64 by default.