org.apache.spark.sql.AnalysisException: Overflow in sum of UINT64 data type_error ai_generated partial

Parquet UINT64 column overflows when cast to signed INT64 in Spark or Arrow

ID: data/parquet-uint64-overflow-cast

Also available as: JSON · Markdown · 中文

75%Fix Rate

85%Confidence

1Evidence

2024-01-15First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
Apache Parquet 2.8.0+	active	—	—	—
Apache Spark 3.4.0	active	—	—	—
Apache Arrow 12.0.0	active	—	—	—

Root Cause

Parquet format supports UINT64 logical type, but many engines (Spark, Arrow) lack native UINT64 support and silently cast to INT64, causing overflow for values > 2^63-1.

generic

中文

Parquet 格式支持 UINT64 逻辑类型，但许多引擎（Spark、Arrow）缺乏原生 UINT64 支持并静默转换为 INT64，导致大于 2^63-1 的值溢出。

Official Documentation

https://spark.apache.org/docs/latest/sql-ref-datatypes.html

Workarounds

85% success Read UINT64 as String type in Spark: spark.read.parquet(path).withColumn("col", col("col").cast("string"))
```
Read UINT64 as String type in Spark: spark.read.parquet(path).withColumn("col", col("col").cast("string"))
```
90% success Use PyArrow with safe_cast=False to preserve UINT64 as binary: pq.read_table(path, safe_cast=False).then convert to Python int via struct.unpack.
```
Use PyArrow with safe_cast=False to preserve UINT64 as binary: pq.read_table(path, safe_cast=False).then convert to Python int via struct.unpack.
```
70% success Pre-process data to ensure UINT64 values fit within INT64 range before writing Parquet.
```
Pre-process data to ensure UINT64 values fit within INT64 range before writing Parquet.
```

中文步骤

Read UINT64 as String type in Spark: spark.read.parquet(path).withColumn("col", col("col").cast("string"))

Use PyArrow with safe_cast=False to preserve UINT64 as binary: pq.read_table(path, safe_cast=False).then convert to Python int via struct.unpack.

Pre-process data to ensure UINT64 values fit within INT64 range before writing Parquet.

Dead Ends

Common approaches that don't work:

Casting to Decimal(38,0) to hold larger values 60% fail
Decimal(38,0) can hold up to 10^38-1, but Spark's decimal precision is limited and arithmetic may still overflow or lose precision when converting back.
Using Double type to avoid overflow 80% fail
Double cannot represent all integers exactly beyond 2^53, causing silent precision loss for large UINT64 values.
Disabling Parquet type promotion entirely 50% fail
This may cause schema compatibility errors for other columns and does not address the root issue of UINT64 handling.