data data_error ai_generated partial

ParquetReader: 页脚CRC损坏——文件可能被截断或覆盖

ParquetReader: Corrupt footer CRC — file may be truncated or overwritten

ID: data/parquet-corrupted-footer-crc

其他格式: JSON · Markdown 中文 · English

85%修复率

88%置信度

1证据数

2024-03-15首次发现

版本兼容性

版本	状态	引入	弃用	备注
Apache Parquet 1.12.3	active	—	—	—
Spark 3.4.1	active	—	—	—
pyarrow 14.0.0	active	—	—	—
Hive 4.0.0	active	—	—	—

根因分析

Parquet文件页脚CRC校验失败，因为文件未完全写入（如Spark任务失败、磁盘已满）或被其他进程部分覆盖。

English

Parquet file footer CRC check fails because the file was not fully written (e.g., Spark task failure, disk full) or was partially overwritten by another process.

generic

官方文档

https://parquet.apache.org/docs/file-format/

解决方案

Recreate the Parquet file from the original data using a reliable write path: spark.write.mode('overwrite').parquet('/path') with checkpointing enabled

Use parquet-cli or pyarrow to attempt reading with `use_legacy_int96_timestamps=False` and `buffer_size=0`; if the footer is partially readable, try `pq.read_table(path, use_pandas_metadata=False)` to skip metadata parsing

Check file size with `ls -l` and compare to expected size from source logs; if truncated, use `dd if=truncated.parquet of=repaired.parquet bs=1 count=<expected_size>` to pad the file (last resort, may not restore data)

无效尝试

常见但无效的做法:

90% 失败
The source file itself is corrupted; re-downloading the same truncated file does not fix the underlying write failure.
85% 失败
parquet-tools meta also reads the footer and will fail with the same CRC error, providing no workaround.
100% 失败
Parquet readers (e.g., pyarrow, spark) always validate the footer CRC; there is no standard option to bypass it.