data
data_error
ai_generated
true
Parquet row group statistics inaccuracy leads to false predicate pushdown pruning
ID: data/parquet-statistics-accuracy
80%Fix Rate
85%Confidence
1Evidence
2023-06-15First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| Apache Parquet 1.12.0 | active | — | — | — |
| Apache Spark 3.3.0 | active | — | — | — |
| Apache Arrow 12.0.0 | active | — | — | — |
| DuckDB 0.8.0 | active | — | — | — |
Root Cause
Parquet file metadata contains min/max statistics that are approximate or stale, causing query engines to incorrectly skip row groups that actually contain matching data.
generic中文
Parquet文件元数据中的最小/最大统计信息是近似值或已过时,导致查询引擎错误地跳过了实际包含匹配数据的行组。
Official Documentation
https://parquet.apache.org/docs/file-format/metadata/statistics/Workarounds
-
95% success Set session property to disable predicate pushdown for the specific query: SET parquet.pushdown=false; SELECT * FROM table WHERE col > 100;
Set session property to disable predicate pushdown for the specific query: SET parquet.pushdown=false; SELECT * FROM table WHERE col > 100;
-
85% success Rewrite the Parquet file with correct statistics using PyArrow: import pyarrow.parquet as pq; table = pq.read_table('bad.parquet'); pq.write_table(table, 'fixed.parquet', write_statistics=True)
Rewrite the Parquet file with correct statistics using PyArrow: import pyarrow.parquet as pq; table = pq.read_table('bad.parquet'); pq.write_table(table, 'fixed.parquet', write_statistics=True)
中文步骤
Set session property to disable predicate pushdown for the specific query: SET parquet.pushdown=false; SELECT * FROM table WHERE col > 100;
Rewrite the Parquet file with correct statistics using PyArrow: import pyarrow.parquet as pq; table = pq.read_table('bad.parquet'); pq.write_table(table, 'fixed.parquet', write_statistics=True)
Dead Ends
Common approaches that don't work:
-
Setting spark.sql.parquet.filterPushdown=false globally
90% fail
Disabling predicate pushdown entirely removes the performance benefit of row group pruning.
-
Running VACUUM or OPTIMIZE on the table assuming it recalculates statistics
70% fail
Rebuilding statistics with OPTIMIZE does not fix the root cause if the writer originally wrote approximate statistics.