ParquetDecodingException
data
data_error
ai_generated
true
Parquet dictionary page truncated — unexpected end of stream
ID: data/parquet-dictionary-page-truncated
80%Fix Rate
85%Confidence
1Evidence
2023-11-15First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| parquet-mr 1.12.0 | active | — | — | — |
| pyarrow 14.0.0 | active | — | — | — |
| spark 3.4.0 | active | — | — | — |
Root Cause
Parquet file dictionary page was not fully written due to incomplete write or partial upload, causing the reader to hit EOF prematurely.
generic中文
由于写入不完整或部分上传,Parquet文件的字典页未完全写入,导致读取器过早遇到EOF。
Official Documentation
https://issues.apache.org/jira/browse/PARQUET-2300Workarounds
-
70% success Verify file integrity using Parquet-tools: `parquet-tools meta corrupted.parquet` — if it fails, re-upload the file from a known good source.
Verify file integrity using Parquet-tools: `parquet-tools meta corrupted.parquet` — if it fails, re-upload the file from a known good source.
-
80% success Repair the file by truncating to the last valid row group using pyarrow: `import pyarrow.parquet as pq; table = pq.read_table('corrupted.parquet', use_pandas_metadata=False); pq.write_table(table, 'repaired.parquet')` — this skips the broken dictionary.
Repair the file by truncating to the last valid row group using pyarrow: `import pyarrow.parquet as pq; table = pq.read_table('corrupted.parquet', use_pandas_metadata=False); pq.write_table(table, 'repaired.parquet')` — this skips the broken dictionary. -
50% success If using Spark, set `spark.sql.parquet.enableVectorizedReader=false` to fall back to non-vectorized reading which may handle partial files.
If using Spark, set `spark.sql.parquet.enableVectorizedReader=false` to fall back to non-vectorized reading which may handle partial files.
中文步骤
Verify file integrity using Parquet-tools: `parquet-tools meta corrupted.parquet` — if it fails, re-upload the file from a known good source.
Repair the file by truncating to the last valid row group using pyarrow: `import pyarrow.parquet as pq; table = pq.read_table('corrupted.parquet', use_pandas_metadata=False); pq.write_table(table, 'repaired.parquet')` — this skips the broken dictionary.If using Spark, set `spark.sql.parquet.enableVectorizedReader=false` to fall back to non-vectorized reading which may handle partial files.
Dead Ends
Common approaches that don't work:
-
Re-download the file from the same source without verifying checksum
60% fail
If the source file is corrupted at the origin, re-downloading doesn't fix the underlying issue.
-
Increase memory allocation for the reader (e.g., spark.executor.memory)
90% fail
The error is about truncated data, not memory limits; more memory doesn't reconstruct missing bytes.
-
Use a different Parquet reader library (e.g., fastparquet instead of pyarrow)
95% fail
All readers will fail on the same truncated dictionary page because the file is structurally incomplete.