No explicit error; timestamp values off by timezone offset
data
data_error
ai_generated
partial
由 Hive 写入的 Parquet INT96 时间戳在读取时带有错误的时区偏移
Parquet INT96 timestamp reads with incorrect timezone offset when written by Hive
ID: data/parquet-int96-timestamp-timezone
80%修复率
86%置信度
1证据数
2023-09-05首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| Apache Hive 3.1.3 | active | — | — | — |
| Apache Spark 3.2.0 | active | — | — | — |
| Apache Impala 4.0.0 | active | — | — | — |
根因分析
Hive 以 UTC 写入 INT96 时间戳,但许多读取器(如旧版 Spark、Impala)假设为本地时区,导致数小时的偏差。
English
Hive writes INT96 timestamps in UTC but many readers (e.g., older Spark, Impala) assume local timezone, causing off-by-hour errors.
官方文档
https://issues.apache.org/jira/browse/SPARK-31476解决方案
-
Use Spark with config: spark.sql.parquet.int96TimestampConversion=true and spark.sql.session.timeZone=UTC to force correct conversion.
-
Rewrite Parquet files using a tool like Parquet-MR with Int96WriteSupport to explicitly store timestamps in UTC.
无效尝试
常见但无效的做法:
-
Setting Spark session timezone to UTC
60% 失败
While this aligns the reader, it does not fix the underlying assumption that INT96 is in local time; the offset is still applied incorrectly.
-
Converting timestamps using date_add/date_sub with fixed offset
70% 失败
The offset may vary by timezone and daylight saving, making a fixed offset incorrect for many cases.