data type_error ai_generated true

Parquet decimal precision overflow when reading into pandas — values truncated or converted to NaN

ID: data/parquet-decimal-precision-overflow-pandas

Also available as: JSON · Markdown · 中文
85%Fix Rate
88%Confidence
1Evidence
2024-02-20First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
pandas 2.2.0 active
pyarrow 15.0.0 active
Apache Parquet 1.13.0 active

Root Cause

Parquet files store decimals with high precision (e.g., DECIMAL(38,10)) but pandas uses Python's float64 or int64, which cannot represent such high precision, causing overflow or silent truncation.

generic

中文

Parquet文件存储高精度十进制数(如DECIMAL(38,10)),但pandas使用Python的float64或int64,无法表示如此高的精度,导致溢出或静默截断。

Official Documentation

https://arrow.apache.org/docs/python/parquet.html#decimal-types

Workarounds

  1. 90% success Read the Parquet file with pyarrow directly and convert to pandas using `pq.read_table(path).to_pandas(timestamp_as_object=True)` which converts decimals to Python Decimal objects. Example: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); df = table.to_pandas(timestamp_as_object=True, date_as_object=True)`
    Read the Parquet file with pyarrow directly and convert to pandas using `pq.read_table(path).to_pandas(timestamp_as_object=True)` which converts decimals to Python Decimal objects. Example: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); df = table.to_pandas(timestamp_as_object=True, date_as_object=True)`
  2. 85% success Use `pd.read_parquet(path, engine='pyarrow', dtype_backend='numpy_nullable')` and then manually convert decimal columns to `pd.StringDtype()` to preserve precision: `df['dec_col'] = df['dec_col'].astype('string')`
    Use `pd.read_parquet(path, engine='pyarrow', dtype_backend='numpy_nullable')` and then manually convert decimal columns to `pd.StringDtype()` to preserve precision: `df['dec_col'] = df['dec_col'].astype('string')`
  3. 80% success If the data fits within 38 digits, use `pd.read_parquet(path, engine='pyarrow', use_nullable_dtypes=True)` which uses `pd.ArrowDtype` for decimals, preserving precision as Python Decimal objects
    If the data fits within 38 digits, use `pd.read_parquet(path, engine='pyarrow', use_nullable_dtypes=True)` which uses `pd.ArrowDtype` for decimals, preserving precision as Python Decimal objects

中文步骤

  1. Read the Parquet file with pyarrow directly and convert to pandas using `pq.read_table(path).to_pandas(timestamp_as_object=True)` which converts decimals to Python Decimal objects. Example: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); df = table.to_pandas(timestamp_as_object=True, date_as_object=True)`
  2. Use `pd.read_parquet(path, engine='pyarrow', dtype_backend='numpy_nullable')` and then manually convert decimal columns to `pd.StringDtype()` to preserve precision: `df['dec_col'] = df['dec_col'].astype('string')`
  3. If the data fits within 38 digits, use `pd.read_parquet(path, engine='pyarrow', use_nullable_dtypes=True)` which uses `pd.ArrowDtype` for decimals, preserving precision as Python Decimal objects

Dead Ends

Common approaches that don't work:

  1. 70% fail

    This uses pyarrow for conversion but still may overflow if the decimal precision exceeds 38 digits or if pyarrow's default decimal type cannot map to pandas.

  2. 90% fail

    Float64 can only represent ~15-17 significant digits; high-precision decimals will be truncated or rounded, losing data.

  3. 75% fail

    Fastparquet has similar limitations and may silently convert decimals to float64, causing the same overflow.