ParquetDecodingException data data_error ai_generated true

Parquet dictionary page truncated — unexpected end of stream

ID: data/parquet-dictionary-page-truncated

Also available as: JSON · Markdown · 中文

80%Fix Rate

85%Confidence

1Evidence

2023-11-15First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
parquet-mr 1.12.0	active	—	—	—
pyarrow 14.0.0	active	—	—	—
spark 3.4.0	active	—	—	—

Root Cause

Parquet file dictionary page was not fully written due to incomplete write or partial upload, causing the reader to hit EOF prematurely.

generic

中文

由于写入不完整或部分上传，Parquet文件的字典页未完全写入，导致读取器过早遇到EOF。

Official Documentation

https://issues.apache.org/jira/browse/PARQUET-2300

Workarounds

70% success Verify file integrity using Parquet-tools: `parquet-tools meta corrupted.parquet` — if it fails, re-upload the file from a known good source.
```
Verify file integrity using Parquet-tools: `parquet-tools meta corrupted.parquet` — if it fails, re-upload the file from a known good source.
```
80% success Repair the file by truncating to the last valid row group using pyarrow: `import pyarrow.parquet as pq; table = pq.read_table('corrupted.parquet', use_pandas_metadata=False); pq.write_table(table, 'repaired.parquet')` — this skips the broken dictionary.
```
Repair the file by truncating to the last valid row group using pyarrow: `import pyarrow.parquet as pq; table = pq.read_table('corrupted.parquet', use_pandas_metadata=False); pq.write_table(table, 'repaired.parquet')` — this skips the broken dictionary.
```
50% success If using Spark, set `spark.sql.parquet.enableVectorizedReader=false` to fall back to non-vectorized reading which may handle partial files.
```
If using Spark, set `spark.sql.parquet.enableVectorizedReader=false` to fall back to non-vectorized reading which may handle partial files.
```

中文步骤

Verify file integrity using Parquet-tools: `parquet-tools meta corrupted.parquet` — if it fails, re-upload the file from a known good source.

Repair the file by truncating to the last valid row group using pyarrow: `import pyarrow.parquet as pq; table = pq.read_table('corrupted.parquet', use_pandas_metadata=False); pq.write_table(table, 'repaired.parquet')` — this skips the broken dictionary.

If using Spark, set `spark.sql.parquet.enableVectorizedReader=false` to fall back to non-vectorized reading which may handle partial files.

Dead Ends

Common approaches that don't work:

Re-download the file from the same source without verifying checksum 60% fail
If the source file is corrupted at the origin, re-downloading doesn't fix the underlying issue.
Increase memory allocation for the reader (e.g., spark.executor.memory) 90% fail
The error is about truncated data, not memory limits; more memory doesn't reconstruct missing bytes.
Use a different Parquet reader library (e.g., fastparquet instead of pyarrow) 95% fail
All readers will fail on the same truncated dictionary page because the file is structurally incomplete.