org.apache.spark.sql.AnalysisException: Overflow in sum of UINT64 data type_error ai_generated partial

Parquet UINT64 列在 Spark 或 Arrow 中转换为有符号 INT64 时溢出

Parquet UINT64 column overflows when cast to signed INT64 in Spark or Arrow

ID: data/parquet-uint64-overflow-cast

其他格式: JSON · Markdown 中文 · English

75%修复率

85%置信度

1证据数

2024-01-15首次发现

版本兼容性

版本	状态	引入	弃用	备注
Apache Parquet 2.8.0+	active	—	—	—
Apache Spark 3.4.0	active	—	—	—
Apache Arrow 12.0.0	active	—	—	—

根因分析

Parquet 格式支持 UINT64 逻辑类型，但许多引擎（Spark、Arrow）缺乏原生 UINT64 支持并静默转换为 INT64，导致大于 2^63-1 的值溢出。

English

Parquet format supports UINT64 logical type, but many engines (Spark, Arrow) lack native UINT64 support and silently cast to INT64, causing overflow for values > 2^63-1.

generic

官方文档

https://spark.apache.org/docs/latest/sql-ref-datatypes.html

解决方案

Read UINT64 as String type in Spark: spark.read.parquet(path).withColumn("col", col("col").cast("string"))

Use PyArrow with safe_cast=False to preserve UINT64 as binary: pq.read_table(path, safe_cast=False).then convert to Python int via struct.unpack.

Pre-process data to ensure UINT64 values fit within INT64 range before writing Parquet.

无效尝试

常见但无效的做法:

Casting to Decimal(38,0) to hold larger values 60% 失败
Decimal(38,0) can hold up to 10^38-1, but Spark's decimal precision is limited and arithmetic may still overflow or lose precision when converting back.
Using Double type to avoid overflow 80% 失败
Double cannot represent all integers exactly beyond 2^53, causing silent precision loss for large UINT64 values.
Disabling Parquet type promotion entirely 50% 失败
This may cause schema compatibility errors for other columns and does not address the root issue of UINT64 handling.