ParquetDecodingException data data_error ai_generated true

Parquet字典页截断 — 意外的流结束

Parquet dictionary page truncated — unexpected end of stream

ID: data/parquet-dictionary-page-truncated

其他格式: JSON · Markdown 中文 · English
80%修复率
85%置信度
1证据数
2023-11-15首次发现

版本兼容性

版本状态引入弃用备注
parquet-mr 1.12.0 active
pyarrow 14.0.0 active
spark 3.4.0 active

根因分析

由于写入不完整或部分上传,Parquet文件的字典页未完全写入,导致读取器过早遇到EOF。

English

Parquet file dictionary page was not fully written due to incomplete write or partial upload, causing the reader to hit EOF prematurely.

generic

官方文档

https://issues.apache.org/jira/browse/PARQUET-2300

解决方案

  1. Verify file integrity using Parquet-tools: `parquet-tools meta corrupted.parquet` — if it fails, re-upload the file from a known good source.
  2. Repair the file by truncating to the last valid row group using pyarrow: `import pyarrow.parquet as pq; table = pq.read_table('corrupted.parquet', use_pandas_metadata=False); pq.write_table(table, 'repaired.parquet')` — this skips the broken dictionary.
  3. If using Spark, set `spark.sql.parquet.enableVectorizedReader=false` to fall back to non-vectorized reading which may handle partial files.

无效尝试

常见但无效的做法:

  1. Re-download the file from the same source without verifying checksum 60% 失败

    If the source file is corrupted at the origin, re-downloading doesn't fix the underlying issue.

  2. Increase memory allocation for the reader (e.g., spark.executor.memory) 90% 失败

    The error is about truncated data, not memory limits; more memory doesn't reconstruct missing bytes.

  3. Use a different Parquet reader library (e.g., fastparquet instead of pyarrow) 95% 失败

    All readers will fail on the same truncated dictionary page because the file is structurally incomplete.