No explicit error; timestamp values off by timezone offset data data_error ai_generated partial

由 Hive 写入的 Parquet INT96 时间戳在读取时带有错误的时区偏移

Parquet INT96 timestamp reads with incorrect timezone offset when written by Hive

ID: data/parquet-int96-timestamp-timezone

其他格式: JSON · Markdown 中文 · English
80%修复率
86%置信度
1证据数
2023-09-05首次发现

版本兼容性

版本状态引入弃用备注
Apache Hive 3.1.3 active
Apache Spark 3.2.0 active
Apache Impala 4.0.0 active

根因分析

Hive 以 UTC 写入 INT96 时间戳,但许多读取器(如旧版 Spark、Impala)假设为本地时区,导致数小时的偏差。

English

Hive writes INT96 timestamps in UTC but many readers (e.g., older Spark, Impala) assume local timezone, causing off-by-hour errors.

generic

官方文档

https://issues.apache.org/jira/browse/SPARK-31476

解决方案

  1. Use Spark with config: spark.sql.parquet.int96TimestampConversion=true and spark.sql.session.timeZone=UTC to force correct conversion.
  2. Rewrite Parquet files using a tool like Parquet-MR with Int96WriteSupport to explicitly store timestamps in UTC.

无效尝试

常见但无效的做法:

  1. Setting Spark session timezone to UTC 60% 失败

    While this aligns the reader, it does not fix the underlying assumption that INT96 is in local time; the offset is still applied incorrectly.

  2. Converting timestamps using date_add/date_sub with fixed offset 70% 失败

    The offset may vary by timezone and daylight saving, making a fixed offset incorrect for many cases.