# 由于朱利安日期转换错误，Parquet INT96时间戳读取为5000年以上

- **ID:** `data/parquet-int96-timestamp-millennium-bug`
- **领域:** data
- **类别:** data_error
- **验证级别:** ai_generated
- **修复率:** 80%

## 根因

Parquet INT96时间戳存储朱利安日数（自公元前4713年以来的天数）和一天中的时间；某些读取器（如旧版Hive、Impala）错误地将朱利安日期解释为Unix纪元偏移，导致日期偏差数百年。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| Apache Parquet 1.12.0 | active | — | — |
| Apache Hive 3.1.3 | active | — | — |
| Apache Impala 4.0.0 | active | — | — |
| pyarrow 13.0.0 | active | — | — |

## 解决方案

1. ```
   In pyarrow, read with `pq.read_table(path, use_legacy_int96_timestamps=False)` to use the corrected conversion. Example: `import pyarrow.parquet as pq; table = pq.read_table('data.parquet', use_legacy_int96_timestamps=False)`
   ```
2. ```
   In Spark, set `spark.sql.parquet.int96TimestampConversion.enabled` to `false` and `spark.sql.parquet.int96RebaseModeInRead` to `CORRECTED` to fix the conversion
   ```
3. ```
   Rewrite the Parquet file using a modern writer (e.g., Spark 3.x) that stores timestamps as INT64 millis instead of INT96, then read with the new file
   ```

## 无效尝试

- **** — The CAST operation uses the same broken conversion logic; it will produce the same erroneous future dates. (95% 失败率)
- **** — Converting INT96 to STRING often results in a binary representation (e.g., '\x00...') that is not human-readable and cannot be parsed into a date. (80% 失败率)
- **** — The bug is in the INT96 conversion logic, which may still be present in newer versions if the file was written by a different tool (e.g., Spark) that uses a non-standard INT96 encoding. (60% 失败率)
