data data_error ai_generated partial

Parquet INT96 timestamp reads as year 5000+ due to Julian date conversion error

ID: data/parquet-int96-timestamp-millennium-bug

Also available as: JSON · Markdown · 中文
80%Fix Rate
87%Confidence
1Evidence
2023-06-10First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
Apache Parquet 1.12.0 active
Apache Hive 3.1.3 active
Apache Impala 4.0.0 active
pyarrow 13.0.0 active

Root Cause

Parquet INT96 timestamps store a Julian day number (days since 4713 BC) and time of day; some readers (e.g., older Hive, Impala) incorrectly interpret the Julian date as a Unix epoch offset, causing dates to be centuries off.

generic

中文

Parquet INT96时间戳存储朱利安日数(自公元前4713年以来的天数)和一天中的时间;某些读取器(如旧版Hive、Impala)错误地将朱利安日期解释为Unix纪元偏移,导致日期偏差数百年。

Official Documentation

https://parquet.apache.org/docs/file-format/types/

Workarounds

  1. 90% success In pyarrow, read with `pq.read_table(path, use_legacy_int96_timestamps=False)` to use the corrected conversion. Example: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet', use_legacy_int96_timestamps=False)`
    In pyarrow, read with `pq.read_table(path, use_legacy_int96_timestamps=False)` to use the corrected conversion. Example: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet', use_legacy_int96_timestamps=False)`
  2. 85% success In Spark, set `spark.sql.parquet.int96TimestampConversion.enabled` to `false` and `spark.sql.parquet.int96RebaseModeInRead` to `CORRECTED` to fix the conversion
    In Spark, set `spark.sql.parquet.int96TimestampConversion.enabled` to `false` and `spark.sql.parquet.int96RebaseModeInRead` to `CORRECTED` to fix the conversion
  3. 95% success Rewrite the Parquet file using a modern writer (e.g., Spark 3.x) that stores timestamps as INT64 millis instead of INT96, then read with the new file
    Rewrite the Parquet file using a modern writer (e.g., Spark 3.x) that stores timestamps as INT64 millis instead of INT96, then read with the new file

中文步骤

  1. In pyarrow, read with `pq.read_table(path, use_legacy_int96_timestamps=False)` to use the corrected conversion. Example: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet', use_legacy_int96_timestamps=False)`
  2. In Spark, set `spark.sql.parquet.int96TimestampConversion.enabled` to `false` and `spark.sql.parquet.int96RebaseModeInRead` to `CORRECTED` to fix the conversion
  3. Rewrite the Parquet file using a modern writer (e.g., Spark 3.x) that stores timestamps as INT64 millis instead of INT96, then read with the new file

Dead Ends

Common approaches that don't work:

  1. 95% fail

    The CAST operation uses the same broken conversion logic; it will produce the same erroneous future dates.

  2. 80% fail

    Converting INT96 to STRING often results in a binary representation (e.g., '\x00...') that is not human-readable and cannot be parsed into a date.

  3. 60% fail

    The bug is in the INT96 conversion logic, which may still be present in newer versions if the file was written by a different tool (e.g., Spark) that uses a non-standard INT96 encoding.