# Parquet UINT64 列在 Spark 或 Arrow 中转换为有符号 INT64 时溢出

- **ID:** `data/parquet-uint64-overflow-cast`
- **领域:** data
- **类别:** type_error
- **错误码:** `org.apache.spark.sql.AnalysisException: Overflow in sum of UINT64`
- **验证级别:** ai_generated
- **修复率:** 75%

## 根因

Parquet 格式支持 UINT64 逻辑类型，但许多引擎（Spark、Arrow）缺乏原生 UINT64 支持并静默转换为 INT64，导致大于 2^63-1 的值溢出。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| Apache Parquet 2.8.0+ | active | — | — |
| Apache Spark 3.4.0 | active | — | — |
| Apache Arrow 12.0.0 | active | — | — |

## 解决方案

1. ```
   Read UINT64 as String type in Spark: spark.read.parquet(path).withColumn("col", col("col").cast("string"))
   ```
2. ```
   Use PyArrow with safe_cast=False to preserve UINT64 as binary: pq.read_table(path, safe_cast=False).then convert to Python int via struct.unpack.
   ```
3. ```
   Pre-process data to ensure UINT64 values fit within INT64 range before writing Parquet.
   ```

## 无效尝试

- **Casting to Decimal(38,0) to hold larger values** — Decimal(38,0) can hold up to 10^38-1, but Spark's decimal precision is limited and arithmetic may still overflow or lose precision when converting back. (60% 失败率)
- **Using Double type to avoid overflow** — Double cannot represent all integers exactly beyond 2^53, causing silent precision loss for large UINT64 values. (80% 失败率)
- **Disabling Parquet type promotion entirely** — This may cause schema compatibility errors for other columns and does not address the root issue of UINT64 handling. (50% 失败率)
