data data_error ai_generated partial

ParquetReader: 页脚CRC损坏——文件可能被截断或覆盖

ParquetReader: Corrupt footer CRC — file may be truncated or overwritten

ID: data/parquet-corrupted-footer-crc

其他格式: JSON · Markdown 中文 · English
85%修复率
88%置信度
1证据数
2024-03-15首次发现

版本兼容性

版本状态引入弃用备注
Apache Parquet 1.12.3 active
Spark 3.4.1 active
pyarrow 14.0.0 active
Hive 4.0.0 active

根因分析

Parquet文件页脚CRC校验失败,因为文件未完全写入(如Spark任务失败、磁盘已满)或被其他进程部分覆盖。

English

Parquet file footer CRC check fails because the file was not fully written (e.g., Spark task failure, disk full) or was partially overwritten by another process.

generic

官方文档

https://parquet.apache.org/docs/file-format/

解决方案

  1. Recreate the Parquet file from the original data using a reliable write path: spark.write.mode('overwrite').parquet('/path') with checkpointing enabled
  2. Use parquet-cli or pyarrow to attempt reading with `use_legacy_int96_timestamps=False` and `buffer_size=0`; if the footer is partially readable, try `pq.read_table(path, use_pandas_metadata=False)` to skip metadata parsing
  3. Check file size with `ls -l` and compare to expected size from source logs; if truncated, use `dd if=truncated.parquet of=repaired.parquet bs=1 count=<expected_size>` to pad the file (last resort, may not restore data)

无效尝试

常见但无效的做法:

  1. 90% 失败

    The source file itself is corrupted; re-downloading the same truncated file does not fix the underlying write failure.

  2. 85% 失败

    parquet-tools meta also reads the footer and will fail with the same CRC error, providing no workaround.

  3. 100% 失败

    Parquet readers (e.g., pyarrow, spark) always validate the footer CRC; there is no standard option to bypass it.