# 由 Hive 写入的 Parquet INT96 时间戳在读取时带有错误的时区偏移

- **ID:** `data/parquet-int96-timestamp-timezone`
- **领域:** data
- **类别:** data_error
- **错误码:** `No explicit error; timestamp values off by timezone offset`
- **验证级别:** ai_generated
- **修复率:** 80%

## 根因

Hive 以 UTC 写入 INT96 时间戳，但许多读取器（如旧版 Spark、Impala）假设为本地时区，导致数小时的偏差。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| Apache Hive 3.1.3 | active | — | — |
| Apache Spark 3.2.0 | active | — | — |
| Apache Impala 4.0.0 | active | — | — |

## 解决方案

1. ```
   Use Spark with config: spark.sql.parquet.int96TimestampConversion=true and spark.sql.session.timeZone=UTC to force correct conversion.
   ```
2. ```
   Rewrite Parquet files using a tool like Parquet-MR with Int96WriteSupport to explicitly store timestamps in UTC.
   ```

## 无效尝试

- **Setting Spark session timezone to UTC** — While this aligns the reader, it does not fix the underlying assumption that INT96 is in local time; the offset is still applied incorrectly. (60% 失败率)
- **Converting timestamps using date_add/date_sub with fixed offset** — The offset may vary by timezone and daylight saving, making a fixed offset incorrect for many cases. (70% 失败率)
