data
data_error
ai_generated
partial
ParquetReader: Corrupt footer CRC — file may be truncated or overwritten
ID: data/parquet-corrupted-footer-crc
85%Fix Rate
88%Confidence
1Evidence
2024-03-15First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| Apache Parquet 1.12.3 | active | — | — | — |
| Spark 3.4.1 | active | — | — | — |
| pyarrow 14.0.0 | active | — | — | — |
| Hive 4.0.0 | active | — | — | — |
Root Cause
Parquet file footer CRC check fails because the file was not fully written (e.g., Spark task failure, disk full) or was partially overwritten by another process.
generic中文
Parquet文件页脚CRC校验失败,因为文件未完全写入(如Spark任务失败、磁盘已满)或被其他进程部分覆盖。
Official Documentation
https://parquet.apache.org/docs/file-format/Workarounds
-
95% success Recreate the Parquet file from the original data using a reliable write path: spark.write.mode('overwrite').parquet('/path') with checkpointing enabled
Recreate the Parquet file from the original data using a reliable write path: spark.write.mode('overwrite').parquet('/path') with checkpointing enabled -
70% success Use parquet-cli or pyarrow to attempt reading with `use_legacy_int96_timestamps=False` and `buffer_size=0`; if the footer is partially readable, try `pq.read_table(path, use_pandas_metadata=False)` to skip metadata parsing
Use parquet-cli or pyarrow to attempt reading with `use_legacy_int96_timestamps=False` and `buffer_size=0`; if the footer is partially readable, try `pq.read_table(path, use_pandas_metadata=False)` to skip metadata parsing
-
50% success Check file size with `ls -l` and compare to expected size from source logs; if truncated, use `dd if=truncated.parquet of=repaired.parquet bs=1 count=<expected_size>` to pad the file (last resort, may not restore data)
Check file size with `ls -l` and compare to expected size from source logs; if truncated, use `dd if=truncated.parquet of=repaired.parquet bs=1 count=<expected_size>` to pad the file (last resort, may not restore data)
中文步骤
Recreate the Parquet file from the original data using a reliable write path: spark.write.mode('overwrite').parquet('/path') with checkpointing enabledUse parquet-cli or pyarrow to attempt reading with `use_legacy_int96_timestamps=False` and `buffer_size=0`; if the footer is partially readable, try `pq.read_table(path, use_pandas_metadata=False)` to skip metadata parsing
Check file size with `ls -l` and compare to expected size from source logs; if truncated, use `dd if=truncated.parquet of=repaired.parquet bs=1 count=<expected_size>` to pad the file (last resort, may not restore data)
Dead Ends
Common approaches that don't work:
-
90% fail
The source file itself is corrupted; re-downloading the same truncated file does not fix the underlying write failure.
-
85% fail
parquet-tools meta also reads the footer and will fail with the same CRC error, providing no workaround.
-
100% fail
Parquet readers (e.g., pyarrow, spark) always validate the footer CRC; there is no standard option to bypass it.