data data_error ai_generated true

Parquet行组统计信息不准确导致谓词下推错误剪枝

Parquet row group statistics inaccuracy leads to false predicate pushdown pruning

ID: data/parquet-statistics-accuracy

其他格式: JSON · Markdown 中文 · English
80%修复率
85%置信度
1证据数
2023-06-15首次发现

版本兼容性

版本状态引入弃用备注
Apache Parquet 1.12.0 active
Apache Spark 3.3.0 active
Apache Arrow 12.0.0 active
DuckDB 0.8.0 active

根因分析

Parquet文件元数据中的最小/最大统计信息是近似值或已过时,导致查询引擎错误地跳过了实际包含匹配数据的行组。

English

Parquet file metadata contains min/max statistics that are approximate or stale, causing query engines to incorrectly skip row groups that actually contain matching data.

generic

官方文档

https://parquet.apache.org/docs/file-format/metadata/statistics/

解决方案

  1. Set session property to disable predicate pushdown for the specific query: SET parquet.pushdown=false; SELECT * FROM table WHERE col > 100;
  2. Rewrite the Parquet file with correct statistics using PyArrow: import pyarrow.parquet as pq; table = pq.read_table('bad.parquet'); pq.write_table(table, 'fixed.parquet', write_statistics=True)

无效尝试

常见但无效的做法:

  1. Setting spark.sql.parquet.filterPushdown=false globally 90% 失败

    Disabling predicate pushdown entirely removes the performance benefit of row group pruning.

  2. Running VACUUM or OPTIMIZE on the table assuming it recalculates statistics 70% 失败

    Rebuilding statistics with OPTIMIZE does not fix the root cause if the writer originally wrote approximate statistics.