data data_error ai_generated true

Parquet row group statistics inaccuracy leads to false predicate pushdown pruning

ID: data/parquet-statistics-accuracy

Also available as: JSON · Markdown · 中文
80%Fix Rate
85%Confidence
1Evidence
2023-06-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
Apache Parquet 1.12.0 active
Apache Spark 3.3.0 active
Apache Arrow 12.0.0 active
DuckDB 0.8.0 active

Root Cause

Parquet file metadata contains min/max statistics that are approximate or stale, causing query engines to incorrectly skip row groups that actually contain matching data.

generic

中文

Parquet文件元数据中的最小/最大统计信息是近似值或已过时,导致查询引擎错误地跳过了实际包含匹配数据的行组。

Official Documentation

https://parquet.apache.org/docs/file-format/metadata/statistics/

Workarounds

  1. 95% success Set session property to disable predicate pushdown for the specific query: SET parquet.pushdown=false; SELECT * FROM table WHERE col > 100;
    Set session property to disable predicate pushdown for the specific query: SET parquet.pushdown=false; SELECT * FROM table WHERE col > 100;
  2. 85% success Rewrite the Parquet file with correct statistics using PyArrow: import pyarrow.parquet as pq; table = pq.read_table('bad.parquet'); pq.write_table(table, 'fixed.parquet', write_statistics=True)
    Rewrite the Parquet file with correct statistics using PyArrow: import pyarrow.parquet as pq; table = pq.read_table('bad.parquet'); pq.write_table(table, 'fixed.parquet', write_statistics=True)

中文步骤

  1. Set session property to disable predicate pushdown for the specific query: SET parquet.pushdown=false; SELECT * FROM table WHERE col > 100;
  2. Rewrite the Parquet file with correct statistics using PyArrow: import pyarrow.parquet as pq; table = pq.read_table('bad.parquet'); pq.write_table(table, 'fixed.parquet', write_statistics=True)

Dead Ends

Common approaches that don't work:

  1. Setting spark.sql.parquet.filterPushdown=false globally 90% fail

    Disabling predicate pushdown entirely removes the performance benefit of row group pruning.

  2. Running VACUUM or OPTIMIZE on the table assuming it recalculates statistics 70% fail

    Rebuilding statistics with OPTIMIZE does not fix the root cause if the writer originally wrote approximate statistics.