data
data_error
ai_generated
true
Parquet行组统计信息不准确导致谓词下推错误剪枝
Parquet row group statistics inaccuracy leads to false predicate pushdown pruning
ID: data/parquet-statistics-accuracy
80%修复率
85%置信度
1证据数
2023-06-15首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| Apache Parquet 1.12.0 | active | — | — | — |
| Apache Spark 3.3.0 | active | — | — | — |
| Apache Arrow 12.0.0 | active | — | — | — |
| DuckDB 0.8.0 | active | — | — | — |
根因分析
Parquet文件元数据中的最小/最大统计信息是近似值或已过时,导致查询引擎错误地跳过了实际包含匹配数据的行组。
English
Parquet file metadata contains min/max statistics that are approximate or stale, causing query engines to incorrectly skip row groups that actually contain matching data.
官方文档
https://parquet.apache.org/docs/file-format/metadata/statistics/解决方案
-
Set session property to disable predicate pushdown for the specific query: SET parquet.pushdown=false; SELECT * FROM table WHERE col > 100;
-
Rewrite the Parquet file with correct statistics using PyArrow: import pyarrow.parquet as pq; table = pq.read_table('bad.parquet'); pq.write_table(table, 'fixed.parquet', write_statistics=True)
无效尝试
常见但无效的做法:
-
Setting spark.sql.parquet.filterPushdown=false globally
90% 失败
Disabling predicate pushdown entirely removes the performance benefit of row group pruning.
-
Running VACUUM or OPTIMIZE on the table assuming it recalculates statistics
70% 失败
Rebuilding statistics with OPTIMIZE does not fix the root cause if the writer originally wrote approximate statistics.