data
data_error
ai_generated
partial
ParquetReader: 页脚CRC损坏——文件可能被截断或覆盖
ParquetReader: Corrupt footer CRC — file may be truncated or overwritten
ID: data/parquet-corrupted-footer-crc
85%修复率
88%置信度
1证据数
2024-03-15首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| Apache Parquet 1.12.3 | active | — | — | — |
| Spark 3.4.1 | active | — | — | — |
| pyarrow 14.0.0 | active | — | — | — |
| Hive 4.0.0 | active | — | — | — |
根因分析
Parquet文件页脚CRC校验失败,因为文件未完全写入(如Spark任务失败、磁盘已满)或被其他进程部分覆盖。
English
Parquet file footer CRC check fails because the file was not fully written (e.g., Spark task failure, disk full) or was partially overwritten by another process.
官方文档
https://parquet.apache.org/docs/file-format/解决方案
-
Recreate the Parquet file from the original data using a reliable write path: spark.write.mode('overwrite').parquet('/path') with checkpointing enabled -
Use parquet-cli or pyarrow to attempt reading with `use_legacy_int96_timestamps=False` and `buffer_size=0`; if the footer is partially readable, try `pq.read_table(path, use_pandas_metadata=False)` to skip metadata parsing
-
Check file size with `ls -l` and compare to expected size from source logs; if truncated, use `dd if=truncated.parquet of=repaired.parquet bs=1 count=<expected_size>` to pad the file (last resort, may not restore data)
无效尝试
常见但无效的做法:
-
90% 失败
The source file itself is corrupted; re-downloading the same truncated file does not fix the underlying write failure.
-
85% 失败
parquet-tools meta also reads the footer and will fail with the same CRC error, providing no workaround.
-
100% 失败
Parquet readers (e.g., pyarrow, spark) always validate the footer CRC; there is no standard option to bypass it.