data data_error ai_generated true

Parquet行组统计信息不准确导致谓词下推错误剪枝

Parquet row group statistics inaccuracy leads to false predicate pushdown pruning

ID: data/parquet-statistics-accuracy

其他格式: JSON · Markdown 中文 · English

80%修复率

85%置信度

1证据数

2023-06-15首次发现

版本兼容性

版本	状态	引入	弃用	备注
Apache Parquet 1.12.0	active	—	—	—
Apache Spark 3.3.0	active	—	—	—
Apache Arrow 12.0.0	active	—	—	—
DuckDB 0.8.0	active	—	—	—

根因分析

Parquet文件元数据中的最小/最大统计信息是近似值或已过时，导致查询引擎错误地跳过了实际包含匹配数据的行组。

English

Parquet file metadata contains min/max statistics that are approximate or stale, causing query engines to incorrectly skip row groups that actually contain matching data.

generic

官方文档

https://parquet.apache.org/docs/file-format/metadata/statistics/

解决方案

Set session property to disable predicate pushdown for the specific query: SET parquet.pushdown=false; SELECT * FROM table WHERE col > 100;

Rewrite the Parquet file with correct statistics using PyArrow: import pyarrow.parquet as pq; table = pq.read_table('bad.parquet'); pq.write_table(table, 'fixed.parquet', write_statistics=True)

无效尝试

常见但无效的做法:

Setting spark.sql.parquet.filterPushdown=false globally 90% 失败
Disabling predicate pushdown entirely removes the performance benefit of row group pruning.
Running VACUUM or OPTIMIZE on the table assuming it recalculates statistics 70% 失败
Rebuilding statistics with OPTIMIZE does not fix the root cause if the writer originally wrote approximate statistics.