data data_error ai_generated partial

ParquetReader: Corrupt footer CRC — file may be truncated or overwritten

ID: data/parquet-corrupted-footer-crc

Also available as: JSON · Markdown · 中文
85%Fix Rate
88%Confidence
1Evidence
2024-03-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
Apache Parquet 1.12.3 active
Spark 3.4.1 active
pyarrow 14.0.0 active
Hive 4.0.0 active

Root Cause

Parquet file footer CRC check fails because the file was not fully written (e.g., Spark task failure, disk full) or was partially overwritten by another process.

generic

中文

Parquet文件页脚CRC校验失败,因为文件未完全写入(如Spark任务失败、磁盘已满)或被其他进程部分覆盖。

Official Documentation

https://parquet.apache.org/docs/file-format/

Workarounds

  1. 95% success Recreate the Parquet file from the original data using a reliable write path: spark.write.mode('overwrite').parquet('/path') with checkpointing enabled
    Recreate the Parquet file from the original data using a reliable write path: spark.write.mode('overwrite').parquet('/path') with checkpointing enabled
  2. 70% success Use parquet-cli or pyarrow to attempt reading with `use_legacy_int96_timestamps=False` and `buffer_size=0`; if the footer is partially readable, try `pq.read_table(path, use_pandas_metadata=False)` to skip metadata parsing
    Use parquet-cli or pyarrow to attempt reading with `use_legacy_int96_timestamps=False` and `buffer_size=0`; if the footer is partially readable, try `pq.read_table(path, use_pandas_metadata=False)` to skip metadata parsing
  3. 50% success Check file size with `ls -l` and compare to expected size from source logs; if truncated, use `dd if=truncated.parquet of=repaired.parquet bs=1 count=<expected_size>` to pad the file (last resort, may not restore data)
    Check file size with `ls -l` and compare to expected size from source logs; if truncated, use `dd if=truncated.parquet of=repaired.parquet bs=1 count=<expected_size>` to pad the file (last resort, may not restore data)

中文步骤

  1. Recreate the Parquet file from the original data using a reliable write path: spark.write.mode('overwrite').parquet('/path') with checkpointing enabled
  2. Use parquet-cli or pyarrow to attempt reading with `use_legacy_int96_timestamps=False` and `buffer_size=0`; if the footer is partially readable, try `pq.read_table(path, use_pandas_metadata=False)` to skip metadata parsing
  3. Check file size with `ls -l` and compare to expected size from source logs; if truncated, use `dd if=truncated.parquet of=repaired.parquet bs=1 count=<expected_size>` to pad the file (last resort, may not restore data)

Dead Ends

Common approaches that don't work:

  1. 90% fail

    The source file itself is corrupted; re-downloading the same truncated file does not fix the underlying write failure.

  2. 85% fail

    parquet-tools meta also reads the footer and will fail with the same CRC error, providing no workaround.

  3. 100% fail

    Parquet readers (e.g., pyarrow, spark) always validate the footer CRC; there is no standard option to bypass it.