ParquetDecodingException data data_error ai_generated true

Parquet dictionary page truncated — unexpected end of stream

ID: data/parquet-dictionary-page-truncated

Also available as: JSON · Markdown · 中文
80%Fix Rate
85%Confidence
1Evidence
2023-11-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
parquet-mr 1.12.0 active
pyarrow 14.0.0 active
spark 3.4.0 active

Root Cause

Parquet file dictionary page was not fully written due to incomplete write or partial upload, causing the reader to hit EOF prematurely.

generic

中文

由于写入不完整或部分上传,Parquet文件的字典页未完全写入,导致读取器过早遇到EOF。

Official Documentation

https://issues.apache.org/jira/browse/PARQUET-2300

Workarounds

  1. 70% success Verify file integrity using Parquet-tools: `parquet-tools meta corrupted.parquet` — if it fails, re-upload the file from a known good source.
    Verify file integrity using Parquet-tools: `parquet-tools meta corrupted.parquet` — if it fails, re-upload the file from a known good source.
  2. 80% success Repair the file by truncating to the last valid row group using pyarrow: `import pyarrow.parquet as pq; table = pq.read_table('corrupted.parquet', use_pandas_metadata=False); pq.write_table(table, 'repaired.parquet')` — this skips the broken dictionary.
    Repair the file by truncating to the last valid row group using pyarrow: `import pyarrow.parquet as pq; table = pq.read_table('corrupted.parquet', use_pandas_metadata=False); pq.write_table(table, 'repaired.parquet')` — this skips the broken dictionary.
  3. 50% success If using Spark, set `spark.sql.parquet.enableVectorizedReader=false` to fall back to non-vectorized reading which may handle partial files.
    If using Spark, set `spark.sql.parquet.enableVectorizedReader=false` to fall back to non-vectorized reading which may handle partial files.

中文步骤

  1. Verify file integrity using Parquet-tools: `parquet-tools meta corrupted.parquet` — if it fails, re-upload the file from a known good source.
  2. Repair the file by truncating to the last valid row group using pyarrow: `import pyarrow.parquet as pq; table = pq.read_table('corrupted.parquet', use_pandas_metadata=False); pq.write_table(table, 'repaired.parquet')` — this skips the broken dictionary.
  3. If using Spark, set `spark.sql.parquet.enableVectorizedReader=false` to fall back to non-vectorized reading which may handle partial files.

Dead Ends

Common approaches that don't work:

  1. Re-download the file from the same source without verifying checksum 60% fail

    If the source file is corrupted at the origin, re-downloading doesn't fix the underlying issue.

  2. Increase memory allocation for the reader (e.g., spark.executor.memory) 90% fail

    The error is about truncated data, not memory limits; more memory doesn't reconstruct missing bytes.

  3. Use a different Parquet reader library (e.g., fastparquet instead of pyarrow) 95% fail

    All readers will fail on the same truncated dictionary page because the file is structurally incomplete.