No explicit error; timestamp values off by timezone offset data data_error ai_generated partial

Parquet INT96 timestamp reads with incorrect timezone offset when written by Hive

ID: data/parquet-int96-timestamp-timezone

Also available as: JSON · Markdown · 中文
80%Fix Rate
86%Confidence
1Evidence
2023-09-05First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
Apache Hive 3.1.3 active
Apache Spark 3.2.0 active
Apache Impala 4.0.0 active

Root Cause

Hive writes INT96 timestamps in UTC but many readers (e.g., older Spark, Impala) assume local timezone, causing off-by-hour errors.

generic

中文

Hive 以 UTC 写入 INT96 时间戳,但许多读取器(如旧版 Spark、Impala)假设为本地时区,导致数小时的偏差。

Official Documentation

https://issues.apache.org/jira/browse/SPARK-31476

Workarounds

  1. 90% success Use Spark with config: spark.sql.parquet.int96TimestampConversion=true and spark.sql.session.timeZone=UTC to force correct conversion.
    Use Spark with config: spark.sql.parquet.int96TimestampConversion=true and spark.sql.session.timeZone=UTC to force correct conversion.
  2. 85% success Rewrite Parquet files using a tool like Parquet-MR with Int96WriteSupport to explicitly store timestamps in UTC.
    Rewrite Parquet files using a tool like Parquet-MR with Int96WriteSupport to explicitly store timestamps in UTC.

中文步骤

  1. Use Spark with config: spark.sql.parquet.int96TimestampConversion=true and spark.sql.session.timeZone=UTC to force correct conversion.
  2. Rewrite Parquet files using a tool like Parquet-MR with Int96WriteSupport to explicitly store timestamps in UTC.

Dead Ends

Common approaches that don't work:

  1. Setting Spark session timezone to UTC 60% fail

    While this aligns the reader, it does not fix the underlying assumption that INT96 is in local time; the offset is still applied incorrectly.

  2. Converting timestamps using date_add/date_sub with fixed offset 70% fail

    The offset may vary by timezone and daylight saving, making a fixed offset incorrect for many cases.