ParquetDecodingException
data
data_error
ai_generated
true
Parquet字典页截断 — 意外的流结束
Parquet dictionary page truncated — unexpected end of stream
ID: data/parquet-dictionary-page-truncated
80%修复率
85%置信度
1证据数
2023-11-15首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| parquet-mr 1.12.0 | active | — | — | — |
| pyarrow 14.0.0 | active | — | — | — |
| spark 3.4.0 | active | — | — | — |
根因分析
由于写入不完整或部分上传,Parquet文件的字典页未完全写入,导致读取器过早遇到EOF。
English
Parquet file dictionary page was not fully written due to incomplete write or partial upload, causing the reader to hit EOF prematurely.
官方文档
https://issues.apache.org/jira/browse/PARQUET-2300解决方案
-
Verify file integrity using Parquet-tools: `parquet-tools meta corrupted.parquet` — if it fails, re-upload the file from a known good source.
-
Repair the file by truncating to the last valid row group using pyarrow: `import pyarrow.parquet as pq; table = pq.read_table('corrupted.parquet', use_pandas_metadata=False); pq.write_table(table, 'repaired.parquet')` — this skips the broken dictionary. -
If using Spark, set `spark.sql.parquet.enableVectorizedReader=false` to fall back to non-vectorized reading which may handle partial files.
无效尝试
常见但无效的做法:
-
Re-download the file from the same source without verifying checksum
60% 失败
If the source file is corrupted at the origin, re-downloading doesn't fix the underlying issue.
-
Increase memory allocation for the reader (e.g., spark.executor.memory)
90% 失败
The error is about truncated data, not memory limits; more memory doesn't reconstruct missing bytes.
-
Use a different Parquet reader library (e.g., fastparquet instead of pyarrow)
95% 失败
All readers will fail on the same truncated dictionary page because the file is structurally incomplete.