data type_error ai_generated true

Parquet decimal precision overflow when reading into pandas — values truncated or converted to NaN

ID: data/parquet-decimal-precision-overflow-pandas

Also available as: JSON · Markdown · 中文

85%Fix Rate

88%Confidence

1Evidence

2024-02-20First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
pandas 2.2.0	active	—	—	—
pyarrow 15.0.0	active	—	—	—
Apache Parquet 1.13.0	active	—	—	—

Root Cause

Parquet files store decimals with high precision (e.g., DECIMAL(38,10)) but pandas uses Python's float64 or int64, which cannot represent such high precision, causing overflow or silent truncation.

generic

中文

Parquet文件存储高精度十进制数（如DECIMAL(38,10)），但pandas使用Python的float64或int64，无法表示如此高的精度，导致溢出或静默截断。

Official Documentation

https://arrow.apache.org/docs/python/parquet.html#decimal-types

Workarounds

90% success Read the Parquet file with pyarrow directly and convert to pandas using `pq.read_table(path).to_pandas(timestamp_as_object=True)` which converts decimals to Python Decimal objects. Example: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); df = table.to_pandas(timestamp_as_object=True, date_as_object=True)`
```
Read the Parquet file with pyarrow directly and convert to pandas using `pq.read_table(path).to_pandas(timestamp_as_object=True)` which converts decimals to Python Decimal objects. Example: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); df = table.to_pandas(timestamp_as_object=True, date_as_object=True)`
```
85% success Use `pd.read_parquet(path, engine='pyarrow', dtype_backend='numpy_nullable')` and then manually convert decimal columns to `pd.StringDtype()` to preserve precision: `df['dec_col'] = df['dec_col'].astype('string')`
```
Use `pd.read_parquet(path, engine='pyarrow', dtype_backend='numpy_nullable')` and then manually convert decimal columns to `pd.StringDtype()` to preserve precision: `df['dec_col'] = df['dec_col'].astype('string')`
```
80% success If the data fits within 38 digits, use `pd.read_parquet(path, engine='pyarrow', use_nullable_dtypes=True)` which uses `pd.ArrowDtype` for decimals, preserving precision as Python Decimal objects
```
If the data fits within 38 digits, use `pd.read_parquet(path, engine='pyarrow', use_nullable_dtypes=True)` which uses `pd.ArrowDtype` for decimals, preserving precision as Python Decimal objects
```

中文步骤

Read the Parquet file with pyarrow directly and convert to pandas using `pq.read_table(path).to_pandas(timestamp_as_object=True)` which converts decimals to Python Decimal objects. Example: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); df = table.to_pandas(timestamp_as_object=True, date_as_object=True)`

Use `pd.read_parquet(path, engine='pyarrow', dtype_backend='numpy_nullable')` and then manually convert decimal columns to `pd.StringDtype()` to preserve precision: `df['dec_col'] = df['dec_col'].astype('string')`

If the data fits within 38 digits, use `pd.read_parquet(path, engine='pyarrow', use_nullable_dtypes=True)` which uses `pd.ArrowDtype` for decimals, preserving precision as Python Decimal objects

Dead Ends

Common approaches that don't work:

70% fail
This uses pyarrow for conversion but still may overflow if the decimal precision exceeds 38 digits or if pyarrow's default decimal type cannot map to pandas.
90% fail
Float64 can only represent ~15-17 significant digits; high-precision decimals will be truncated or rounded, losing data.
75% fail
Fastparquet has similar limitations and may silently convert decimals to float64, causing the same overflow.