ArrowNotImplementedError data type_error ai_generated true

读取Parquet到pandas时十进制精度溢出

Parquet decimal precision overflow when reading into pandas

ID: data/parquet-decimal-overflow

其他格式: JSON · Markdown 中文 · English
85%修复率
83%置信度
1证据数
2024-01-10首次发现

版本兼容性

版本状态引入弃用备注
pyarrow 12.0.0 active
pyarrow 14.0.1 active
pandas 2.2.0 active

根因分析

Parquet文件以任意精度存储十进制数(例如decimal(38,10)),但pandas默认将其转换为float64,导致超过float64容量的值溢出或精度丢失。

English

Parquet files store decimals with arbitrary precision (e.g., decimal(38,10)), but pandas converts them to float64 by default, causing overflow or precision loss for values exceeding float64 capacity.

generic

官方文档

https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html

解决方案

  1. Read with pyarrow and specify decimal type: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); from decimal import Decimal; df = table.to_pandas(types_mapper={pa.decimal128(38,10): Decimal})`
  2. Use pandas read_parquet with dtype_backend='pyarrow': `df = pd.read_parquet('data.parquet', dtype_backend='pyarrow')`

无效尝试

常见但无效的做法:

  1. 70% 失败

    This only preserves pandas-specific metadata like index names; it does not change the decimal-to-float conversion behavior.

  2. 85% 失败

    The overflow already occurred during reading; the string representation will show the truncated/rounded value.

  3. 75% 失败

    Fastparquet has the same limitation; it also converts decimals to float64 by default.