# Parquet字典页截断 — 意外的流结束

- **ID:** `data/parquet-dictionary-page-truncated`
- **领域:** data
- **类别:** data_error
- **错误码:** `ParquetDecodingException`
- **验证级别:** ai_generated
- **修复率:** 80%

## 根因

由于写入不完整或部分上传，Parquet文件的字典页未完全写入，导致读取器过早遇到EOF。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| parquet-mr 1.12.0 | active | — | — |
| pyarrow 14.0.0 | active | — | — |
| spark 3.4.0 | active | — | — |

## 解决方案

1. ```
   Verify file integrity using Parquet-tools: `parquet-tools meta corrupted.parquet` — if it fails, re-upload the file from a known good source.
   ```
2. ```
   Repair the file by truncating to the last valid row group using pyarrow: `import pyarrow.parquet as pq; table = pq.read_table('corrupted.parquet', use_pandas_metadata=False); pq.write_table(table, 'repaired.parquet')` — this skips the broken dictionary.
   ```
3. ```
   If using Spark, set `spark.sql.parquet.enableVectorizedReader=false` to fall back to non-vectorized reading which may handle partial files.
   ```

## 无效尝试

- **Re-download the file from the same source without verifying checksum** — If the source file is corrupted at the origin, re-downloading doesn't fix the underlying issue. (60% 失败率)
- **Increase memory allocation for the reader (e.g., spark.executor.memory)** — The error is about truncated data, not memory limits; more memory doesn't reconstruct missing bytes. (90% 失败率)
- **Use a different Parquet reader library (e.g., fastparquet instead of pyarrow)** — All readers will fail on the same truncated dictionary page because the file is structurally incomplete. (95% 失败率)
