data type_error ai_generated true

读取到pandas时Parquet十进制精度溢出——值被截断或转换为NaN

Parquet decimal precision overflow when reading into pandas — values truncated or converted to NaN

ID: data/parquet-decimal-precision-overflow-pandas

其他格式: JSON · Markdown 中文 · English
85%修复率
88%置信度
1证据数
2024-02-20首次发现

版本兼容性

版本状态引入弃用备注
pandas 2.2.0 active
pyarrow 15.0.0 active
Apache Parquet 1.13.0 active

根因分析

Parquet文件存储高精度十进制数(如DECIMAL(38,10)),但pandas使用Python的float64或int64,无法表示如此高的精度,导致溢出或静默截断。

English

Parquet files store decimals with high precision (e.g., DECIMAL(38,10)) but pandas uses Python's float64 or int64, which cannot represent such high precision, causing overflow or silent truncation.

generic

官方文档

https://arrow.apache.org/docs/python/parquet.html#decimal-types

解决方案

  1. Read the Parquet file with pyarrow directly and convert to pandas using `pq.read_table(path).to_pandas(timestamp_as_object=True)` which converts decimals to Python Decimal objects. Example: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); df = table.to_pandas(timestamp_as_object=True, date_as_object=True)`
  2. Use `pd.read_parquet(path, engine='pyarrow', dtype_backend='numpy_nullable')` and then manually convert decimal columns to `pd.StringDtype()` to preserve precision: `df['dec_col'] = df['dec_col'].astype('string')`
  3. If the data fits within 38 digits, use `pd.read_parquet(path, engine='pyarrow', use_nullable_dtypes=True)` which uses `pd.ArrowDtype` for decimals, preserving precision as Python Decimal objects

无效尝试

常见但无效的做法:

  1. 70% 失败

    This uses pyarrow for conversion but still may overflow if the decimal precision exceeds 38 digits or if pyarrow's default decimal type cannot map to pandas.

  2. 90% 失败

    Float64 can only represent ~15-17 significant digits; high-precision decimals will be truncated or rounded, losing data.

  3. 75% 失败

    Fastparquet has similar limitations and may silently convert decimals to float64, causing the same overflow.