data data_error ai_generated partial

ParquetReader: Corrupt footer CRC — file may be truncated or overwritten

ID: data/parquet-corrupted-footer-crc

Also available as: JSON · Markdown · 中文

85%Fix Rate

88%Confidence

1Evidence

2024-03-15First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
Apache Parquet 1.12.3	active	—	—	—
Spark 3.4.1	active	—	—	—
pyarrow 14.0.0	active	—	—	—
Hive 4.0.0	active	—	—	—

Root Cause

Parquet file footer CRC check fails because the file was not fully written (e.g., Spark task failure, disk full) or was partially overwritten by another process.

generic

中文

Parquet文件页脚CRC校验失败，因为文件未完全写入（如Spark任务失败、磁盘已满）或被其他进程部分覆盖。

Official Documentation

https://parquet.apache.org/docs/file-format/

Workarounds

95% success Recreate the Parquet file from the original data using a reliable write path: spark.write.mode('overwrite').parquet('/path') with checkpointing enabled
```
Recreate the Parquet file from the original data using a reliable write path: spark.write.mode('overwrite').parquet('/path') with checkpointing enabled
```
70% success Use parquet-cli or pyarrow to attempt reading with `use_legacy_int96_timestamps=False` and `buffer_size=0`; if the footer is partially readable, try `pq.read_table(path, use_pandas_metadata=False)` to skip metadata parsing
```
Use parquet-cli or pyarrow to attempt reading with `use_legacy_int96_timestamps=False` and `buffer_size=0`; if the footer is partially readable, try `pq.read_table(path, use_pandas_metadata=False)` to skip metadata parsing
```
50% success Check file size with `ls -l` and compare to expected size from source logs; if truncated, use `dd if=truncated.parquet of=repaired.parquet bs=1 count=<expected_size>` to pad the file (last resort, may not restore data)
```
Check file size with `ls -l` and compare to expected size from source logs; if truncated, use `dd if=truncated.parquet of=repaired.parquet bs=1 count=<expected_size>` to pad the file (last resort, may not restore data)
```

中文步骤

Recreate the Parquet file from the original data using a reliable write path: spark.write.mode('overwrite').parquet('/path') with checkpointing enabled

Use parquet-cli or pyarrow to attempt reading with `use_legacy_int96_timestamps=False` and `buffer_size=0`; if the footer is partially readable, try `pq.read_table(path, use_pandas_metadata=False)` to skip metadata parsing

Check file size with `ls -l` and compare to expected size from source logs; if truncated, use `dd if=truncated.parquet of=repaired.parquet bs=1 count=<expected_size>` to pad the file (last resort, may not restore data)

Dead Ends

Common approaches that don't work:

90% fail
The source file itself is corrupted; re-downloading the same truncated file does not fix the underlying write failure.
85% fail
parquet-tools meta also reads the footer and will fail with the same CRC error, providing no workaround.
100% fail
Parquet readers (e.g., pyarrow, spark) always validate the footer CRC; there is no standard option to bypass it.