Parquet decimal precision overflow when reading into pandas — values truncated or converted to NaN
ID: data/parquet-decimal-precision-overflow-pandas
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| pandas 2.2.0 | active | — | — | — |
| pyarrow 15.0.0 | active | — | — | — |
| Apache Parquet 1.13.0 | active | — | — | — |
Root Cause
Parquet files store decimals with high precision (e.g., DECIMAL(38,10)) but pandas uses Python's float64 or int64, which cannot represent such high precision, causing overflow or silent truncation.
generic中文
Parquet文件存储高精度十进制数(如DECIMAL(38,10)),但pandas使用Python的float64或int64,无法表示如此高的精度,导致溢出或静默截断。
Official Documentation
https://arrow.apache.org/docs/python/parquet.html#decimal-typesWorkarounds
-
90% success Read the Parquet file with pyarrow directly and convert to pandas using `pq.read_table(path).to_pandas(timestamp_as_object=True)` which converts decimals to Python Decimal objects. Example: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); df = table.to_pandas(timestamp_as_object=True, date_as_object=True)`
Read the Parquet file with pyarrow directly and convert to pandas using `pq.read_table(path).to_pandas(timestamp_as_object=True)` which converts decimals to Python Decimal objects. Example: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); df = table.to_pandas(timestamp_as_object=True, date_as_object=True)` -
85% success Use `pd.read_parquet(path, engine='pyarrow', dtype_backend='numpy_nullable')` and then manually convert decimal columns to `pd.StringDtype()` to preserve precision: `df['dec_col'] = df['dec_col'].astype('string')`
Use `pd.read_parquet(path, engine='pyarrow', dtype_backend='numpy_nullable')` and then manually convert decimal columns to `pd.StringDtype()` to preserve precision: `df['dec_col'] = df['dec_col'].astype('string')` -
80% success If the data fits within 38 digits, use `pd.read_parquet(path, engine='pyarrow', use_nullable_dtypes=True)` which uses `pd.ArrowDtype` for decimals, preserving precision as Python Decimal objects
If the data fits within 38 digits, use `pd.read_parquet(path, engine='pyarrow', use_nullable_dtypes=True)` which uses `pd.ArrowDtype` for decimals, preserving precision as Python Decimal objects
中文步骤
Read the Parquet file with pyarrow directly and convert to pandas using `pq.read_table(path).to_pandas(timestamp_as_object=True)` which converts decimals to Python Decimal objects. Example: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); df = table.to_pandas(timestamp_as_object=True, date_as_object=True)`Use `pd.read_parquet(path, engine='pyarrow', dtype_backend='numpy_nullable')` and then manually convert decimal columns to `pd.StringDtype()` to preserve precision: `df['dec_col'] = df['dec_col'].astype('string')`If the data fits within 38 digits, use `pd.read_parquet(path, engine='pyarrow', use_nullable_dtypes=True)` which uses `pd.ArrowDtype` for decimals, preserving precision as Python Decimal objects
Dead Ends
Common approaches that don't work:
-
70% fail
This uses pyarrow for conversion but still may overflow if the decimal precision exceeds 38 digits or if pyarrow's default decimal type cannot map to pandas.
-
90% fail
Float64 can only represent ~15-17 significant digits; high-precision decimals will be truncated or rounded, losing data.
-
75% fail
Fastparquet has similar limitations and may silently convert decimals to float64, causing the same overflow.