Parquet UINT64 列在 Spark 或 Arrow 中转换为有符号 INT64 时溢出
Parquet UINT64 column overflows when cast to signed INT64 in Spark or Arrow
ID: data/parquet-uint64-overflow-cast
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| Apache Parquet 2.8.0+ | active | — | — | — |
| Apache Spark 3.4.0 | active | — | — | — |
| Apache Arrow 12.0.0 | active | — | — | — |
根因分析
Parquet 格式支持 UINT64 逻辑类型,但许多引擎(Spark、Arrow)缺乏原生 UINT64 支持并静默转换为 INT64,导致大于 2^63-1 的值溢出。
English
Parquet format supports UINT64 logical type, but many engines (Spark, Arrow) lack native UINT64 support and silently cast to INT64, causing overflow for values > 2^63-1.
官方文档
https://spark.apache.org/docs/latest/sql-ref-datatypes.html解决方案
-
Read UINT64 as String type in Spark: spark.read.parquet(path).withColumn("col", col("col").cast("string")) -
Use PyArrow with safe_cast=False to preserve UINT64 as binary: pq.read_table(path, safe_cast=False).then convert to Python int via struct.unpack.
-
Pre-process data to ensure UINT64 values fit within INT64 range before writing Parquet.
无效尝试
常见但无效的做法:
-
Casting to Decimal(38,0) to hold larger values
60% 失败
Decimal(38,0) can hold up to 10^38-1, but Spark's decimal precision is limited and arithmetic may still overflow or lose precision when converting back.
-
Using Double type to avoid overflow
80% 失败
Double cannot represent all integers exactly beyond 2^53, causing silent precision loss for large UINT64 values.
-
Disabling Parquet type promotion entirely
50% 失败
This may cause schema compatibility errors for other columns and does not address the root issue of UINT64 handling.