org.apache.spark.sql.AnalysisException: Overflow in sum of UINT64
data
type_error
ai_generated
partial
Parquet UINT64 column overflows when cast to signed INT64 in Spark or Arrow
ID: data/parquet-uint64-overflow-cast
75%Fix Rate
85%Confidence
1Evidence
2024-01-15First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| Apache Parquet 2.8.0+ | active | — | — | — |
| Apache Spark 3.4.0 | active | — | — | — |
| Apache Arrow 12.0.0 | active | — | — | — |
Root Cause
Parquet format supports UINT64 logical type, but many engines (Spark, Arrow) lack native UINT64 support and silently cast to INT64, causing overflow for values > 2^63-1.
generic中文
Parquet 格式支持 UINT64 逻辑类型,但许多引擎(Spark、Arrow)缺乏原生 UINT64 支持并静默转换为 INT64,导致大于 2^63-1 的值溢出。
Official Documentation
https://spark.apache.org/docs/latest/sql-ref-datatypes.htmlWorkarounds
-
85% success Read UINT64 as String type in Spark: spark.read.parquet(path).withColumn("col", col("col").cast("string"))
Read UINT64 as String type in Spark: spark.read.parquet(path).withColumn("col", col("col").cast("string")) -
90% success Use PyArrow with safe_cast=False to preserve UINT64 as binary: pq.read_table(path, safe_cast=False).then convert to Python int via struct.unpack.
Use PyArrow with safe_cast=False to preserve UINT64 as binary: pq.read_table(path, safe_cast=False).then convert to Python int via struct.unpack.
-
70% success Pre-process data to ensure UINT64 values fit within INT64 range before writing Parquet.
Pre-process data to ensure UINT64 values fit within INT64 range before writing Parquet.
中文步骤
Read UINT64 as String type in Spark: spark.read.parquet(path).withColumn("col", col("col").cast("string"))Use PyArrow with safe_cast=False to preserve UINT64 as binary: pq.read_table(path, safe_cast=False).then convert to Python int via struct.unpack.
Pre-process data to ensure UINT64 values fit within INT64 range before writing Parquet.
Dead Ends
Common approaches that don't work:
-
Casting to Decimal(38,0) to hold larger values
60% fail
Decimal(38,0) can hold up to 10^38-1, but Spark's decimal precision is limited and arithmetic may still overflow or lose precision when converting back.
-
Using Double type to avoid overflow
80% fail
Double cannot represent all integers exactly beyond 2^53, causing silent precision loss for large UINT64 values.
-
Disabling Parquet type promotion entirely
50% fail
This may cause schema compatibility errors for other columns and does not address the root issue of UINT64 handling.