# Parquet行组统计信息不准确导致谓词下推错误剪枝

- **ID:** `data/parquet-statistics-accuracy`
- **领域:** data
- **类别:** data_error
- **验证级别:** ai_generated
- **修复率:** 80%

## 根因

Parquet文件元数据中的最小/最大统计信息是近似值或已过时，导致查询引擎错误地跳过了实际包含匹配数据的行组。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| Apache Parquet 1.12.0 | active | — | — |
| Apache Spark 3.3.0 | active | — | — |
| Apache Arrow 12.0.0 | active | — | — |
| DuckDB 0.8.0 | active | — | — |

## 解决方案

1. ```
   Set session property to disable predicate pushdown for the specific query: SET parquet.pushdown=false; SELECT * FROM table WHERE col > 100;
   ```
2. ```
   Rewrite the Parquet file with correct statistics using PyArrow: import pyarrow.parquet as pq; table = pq.read_table('bad.parquet'); pq.write_table(table, 'fixed.parquet', write_statistics=True)
   ```

## 无效尝试

- **Setting spark.sql.parquet.filterPushdown=false globally** — Disabling predicate pushdown entirely removes the performance benefit of row group pruning. (90% 失败率)
- **Running VACUUM or OPTIMIZE on the table assuming it recalculates statistics** — Rebuilding statistics with OPTIMIZE does not fix the root cause if the writer originally wrote approximate statistics. (70% 失败率)
