ParquetDecodingException data data_error ai_generated true

Parquet字典页截断 — 意外的流结束

Parquet dictionary page truncated — unexpected end of stream

ID: data/parquet-dictionary-page-truncated

其他格式: JSON · Markdown 中文 · English

80%修复率

85%置信度

1证据数

2023-11-15首次发现

版本兼容性

版本	状态	引入	弃用	备注
parquet-mr 1.12.0	active	—	—	—
pyarrow 14.0.0	active	—	—	—
spark 3.4.0	active	—	—	—

根因分析

由于写入不完整或部分上传，Parquet文件的字典页未完全写入，导致读取器过早遇到EOF。

English

Parquet file dictionary page was not fully written due to incomplete write or partial upload, causing the reader to hit EOF prematurely.

generic

官方文档

https://issues.apache.org/jira/browse/PARQUET-2300

解决方案

Verify file integrity using Parquet-tools: `parquet-tools meta corrupted.parquet` — if it fails, re-upload the file from a known good source.

Repair the file by truncating to the last valid row group using pyarrow: `import pyarrow.parquet as pq; table = pq.read_table('corrupted.parquet', use_pandas_metadata=False); pq.write_table(table, 'repaired.parquet')` — this skips the broken dictionary.

If using Spark, set `spark.sql.parquet.enableVectorizedReader=false` to fall back to non-vectorized reading which may handle partial files.

无效尝试

常见但无效的做法:

Re-download the file from the same source without verifying checksum 60% 失败
If the source file is corrupted at the origin, re-downloading doesn't fix the underlying issue.
Increase memory allocation for the reader (e.g., spark.executor.memory) 90% 失败
The error is about truncated data, not memory limits; more memory doesn't reconstruct missing bytes.
Use a different Parquet reader library (e.g., fastparquet instead of pyarrow) 95% 失败
All readers will fail on the same truncated dictionary page because the file is structurally incomplete.