# ParquetReader: 页脚CRC损坏——文件可能被截断或覆盖

- **ID:** `data/parquet-corrupted-footer-crc`
- **领域:** data
- **类别:** data_error
- **验证级别:** ai_generated
- **修复率:** 85%

## 根因

Parquet文件页脚CRC校验失败，因为文件未完全写入（如Spark任务失败、磁盘已满）或被其他进程部分覆盖。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| Apache Parquet 1.12.3 | active | — | — |
| Spark 3.4.1 | active | — | — |
| pyarrow 14.0.0 | active | — | — |
| Hive 4.0.0 | active | — | — |

## 解决方案

1. ```
   Recreate the Parquet file from the original data using a reliable write path: spark.write.mode('overwrite').parquet('/path') with checkpointing enabled
   ```
2. ```
   Use parquet-cli or pyarrow to attempt reading with `use_legacy_int96_timestamps=False` and `buffer_size=0`; if the footer is partially readable, try `pq.read_table(path, use_pandas_metadata=False)` to skip metadata parsing
   ```
3. ```
   Check file size with `ls -l` and compare to expected size from source logs; if truncated, use `dd if=truncated.parquet of=repaired.parquet bs=1 count=<expected_size>` to pad the file (last resort, may not restore data)
   ```

## 无效尝试

- **** — The source file itself is corrupted; re-downloading the same truncated file does not fix the underlying write failure. (90% 失败率)
- **** — parquet-tools meta also reads the footer and will fail with the same CRC error, providing no workaround. (85% 失败率)
- **** — Parquet readers (e.g., pyarrow, spark) always validate the footer CRC; there is no standard option to bypass it. (100% 失败率)
