ArrowNotImplementedError
data
type_error
ai_generated
true
读取Parquet到pandas时十进制精度溢出
Parquet decimal precision overflow when reading into pandas
ID: data/parquet-decimal-overflow
85%修复率
83%置信度
1证据数
2024-01-10首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| pyarrow 12.0.0 | active | — | — | — |
| pyarrow 14.0.1 | active | — | — | — |
| pandas 2.2.0 | active | — | — | — |
根因分析
Parquet文件以任意精度存储十进制数(例如decimal(38,10)),但pandas默认将其转换为float64,导致超过float64容量的值溢出或精度丢失。
English
Parquet files store decimals with arbitrary precision (e.g., decimal(38,10)), but pandas converts them to float64 by default, causing overflow or precision loss for values exceeding float64 capacity.
官方文档
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html解决方案
-
Read with pyarrow and specify decimal type: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet'); from decimal import Decimal; df = table.to_pandas(types_mapper={pa.decimal128(38,10): Decimal})` -
Use pandas read_parquet with dtype_backend='pyarrow': `df = pd.read_parquet('data.parquet', dtype_backend='pyarrow')`
无效尝试
常见但无效的做法:
-
70% 失败
This only preserves pandas-specific metadata like index names; it does not change the decimal-to-float conversion behavior.
-
85% 失败
The overflow already occurred during reading; the string representation will show the truncated/rounded value.
-
75% 失败
Fastparquet has the same limitation; it also converts decimals to float64 by default.