No explicit error; timestamp values off by timezone offset
data
data_error
ai_generated
partial
Parquet INT96 timestamp reads with incorrect timezone offset when written by Hive
ID: data/parquet-int96-timestamp-timezone
80%Fix Rate
86%Confidence
1Evidence
2023-09-05First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| Apache Hive 3.1.3 | active | — | — | — |
| Apache Spark 3.2.0 | active | — | — | — |
| Apache Impala 4.0.0 | active | — | — | — |
Root Cause
Hive writes INT96 timestamps in UTC but many readers (e.g., older Spark, Impala) assume local timezone, causing off-by-hour errors.
generic中文
Hive 以 UTC 写入 INT96 时间戳,但许多读取器(如旧版 Spark、Impala)假设为本地时区,导致数小时的偏差。
Official Documentation
https://issues.apache.org/jira/browse/SPARK-31476Workarounds
-
90% success Use Spark with config: spark.sql.parquet.int96TimestampConversion=true and spark.sql.session.timeZone=UTC to force correct conversion.
Use Spark with config: spark.sql.parquet.int96TimestampConversion=true and spark.sql.session.timeZone=UTC to force correct conversion.
-
85% success Rewrite Parquet files using a tool like Parquet-MR with Int96WriteSupport to explicitly store timestamps in UTC.
Rewrite Parquet files using a tool like Parquet-MR with Int96WriteSupport to explicitly store timestamps in UTC.
中文步骤
Use Spark with config: spark.sql.parquet.int96TimestampConversion=true and spark.sql.session.timeZone=UTC to force correct conversion.
Rewrite Parquet files using a tool like Parquet-MR with Int96WriteSupport to explicitly store timestamps in UTC.
Dead Ends
Common approaches that don't work:
-
Setting Spark session timezone to UTC
60% fail
While this aligns the reader, it does not fix the underlying assumption that INT96 is in local time; the offset is still applied incorrectly.
-
Converting timestamps using date_add/date_sub with fixed offset
70% fail
The offset may vary by timezone and daylight saving, making a fixed offset incorrect for many cases.