org.apache.spark.sql.AnalysisException: Overflow in sum of UINT64 data type_error ai_generated partial

Parquet UINT64 column overflows when cast to signed INT64 in Spark or Arrow

ID: data/parquet-uint64-overflow-cast

Also available as: JSON · Markdown · 中文
75%Fix Rate
85%Confidence
1Evidence
2024-01-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
Apache Parquet 2.8.0+ active
Apache Spark 3.4.0 active
Apache Arrow 12.0.0 active

Root Cause

Parquet format supports UINT64 logical type, but many engines (Spark, Arrow) lack native UINT64 support and silently cast to INT64, causing overflow for values > 2^63-1.

generic

中文

Parquet 格式支持 UINT64 逻辑类型,但许多引擎(Spark、Arrow)缺乏原生 UINT64 支持并静默转换为 INT64,导致大于 2^63-1 的值溢出。

Official Documentation

https://spark.apache.org/docs/latest/sql-ref-datatypes.html

Workarounds

  1. 85% success Read UINT64 as String type in Spark: spark.read.parquet(path).withColumn("col", col("col").cast("string"))
    Read UINT64 as String type in Spark: spark.read.parquet(path).withColumn("col", col("col").cast("string"))
  2. 90% success Use PyArrow with safe_cast=False to preserve UINT64 as binary: pq.read_table(path, safe_cast=False).then convert to Python int via struct.unpack.
    Use PyArrow with safe_cast=False to preserve UINT64 as binary: pq.read_table(path, safe_cast=False).then convert to Python int via struct.unpack.
  3. 70% success Pre-process data to ensure UINT64 values fit within INT64 range before writing Parquet.
    Pre-process data to ensure UINT64 values fit within INT64 range before writing Parquet.

中文步骤

  1. Read UINT64 as String type in Spark: spark.read.parquet(path).withColumn("col", col("col").cast("string"))
  2. Use PyArrow with safe_cast=False to preserve UINT64 as binary: pq.read_table(path, safe_cast=False).then convert to Python int via struct.unpack.
  3. Pre-process data to ensure UINT64 values fit within INT64 range before writing Parquet.

Dead Ends

Common approaches that don't work:

  1. Casting to Decimal(38,0) to hold larger values 60% fail

    Decimal(38,0) can hold up to 10^38-1, but Spark's decimal precision is limited and arithmetic may still overflow or lose precision when converting back.

  2. Using Double type to avoid overflow 80% fail

    Double cannot represent all integers exactly beyond 2^53, causing silent precision loss for large UINT64 values.

  3. Disabling Parquet type promotion entirely 50% fail

    This may cause schema compatibility errors for other columns and does not address the root issue of UINT64 handling.