org.apache.spark.sql.AnalysisException: Overflow in sum of UINT64 data type_error ai_generated partial

Parquet UINT64 列在 Spark 或 Arrow 中转换为有符号 INT64 时溢出

Parquet UINT64 column overflows when cast to signed INT64 in Spark or Arrow

ID: data/parquet-uint64-overflow-cast

其他格式: JSON · Markdown 中文 · English
75%修复率
85%置信度
1证据数
2024-01-15首次发现

版本兼容性

版本状态引入弃用备注
Apache Parquet 2.8.0+ active
Apache Spark 3.4.0 active
Apache Arrow 12.0.0 active

根因分析

Parquet 格式支持 UINT64 逻辑类型,但许多引擎(Spark、Arrow)缺乏原生 UINT64 支持并静默转换为 INT64,导致大于 2^63-1 的值溢出。

English

Parquet format supports UINT64 logical type, but many engines (Spark, Arrow) lack native UINT64 support and silently cast to INT64, causing overflow for values > 2^63-1.

generic

官方文档

https://spark.apache.org/docs/latest/sql-ref-datatypes.html

解决方案

  1. Read UINT64 as String type in Spark: spark.read.parquet(path).withColumn("col", col("col").cast("string"))
  2. Use PyArrow with safe_cast=False to preserve UINT64 as binary: pq.read_table(path, safe_cast=False).then convert to Python int via struct.unpack.
  3. Pre-process data to ensure UINT64 values fit within INT64 range before writing Parquet.

无效尝试

常见但无效的做法:

  1. Casting to Decimal(38,0) to hold larger values 60% 失败

    Decimal(38,0) can hold up to 10^38-1, but Spark's decimal precision is limited and arithmetic may still overflow or lose precision when converting back.

  2. Using Double type to avoid overflow 80% 失败

    Double cannot represent all integers exactly beyond 2^53, causing silent precision loss for large UINT64 values.

  3. Disabling Parquet type promotion entirely 50% 失败

    This may cause schema compatibility errors for other columns and does not address the root issue of UINT64 handling.