# Parquet row group statistics inaccuracy leads to false predicate pushdown pruning

- **ID:** `data/parquet-statistics-accuracy`
- **Domain:** data
- **Category:** data_error
- **Verification:** ai_generated
- **Fix Rate:** 80%

## Root Cause

Parquet file metadata contains min/max statistics that are approximate or stale, causing query engines to incorrectly skip row groups that actually contain matching data.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| Apache Parquet 1.12.0 | active | — | — |
| Apache Spark 3.3.0 | active | — | — |
| Apache Arrow 12.0.0 | active | — | — |
| DuckDB 0.8.0 | active | — | — |

## Workarounds

1. **Set session property to disable predicate pushdown for the specific query: SET parquet.pushdown=false; SELECT * FROM table WHERE col > 100;** (95% success)
   ```
   Set session property to disable predicate pushdown for the specific query: SET parquet.pushdown=false; SELECT * FROM table WHERE col > 100;
   ```
2. **Rewrite the Parquet file with correct statistics using PyArrow: import pyarrow.parquet as pq; table = pq.read_table('bad.parquet'); pq.write_table(table, 'fixed.parquet', write_statistics=True)** (85% success)
   ```
   Rewrite the Parquet file with correct statistics using PyArrow: import pyarrow.parquet as pq; table = pq.read_table('bad.parquet'); pq.write_table(table, 'fixed.parquet', write_statistics=True)
   ```

## Dead Ends

- **Setting spark.sql.parquet.filterPushdown=false globally** — Disabling predicate pushdown entirely removes the performance benefit of row group pruning. (90% fail)
- **Running VACUUM or OPTIMIZE on the table assuming it recalculates statistics** — Rebuilding statistics with OPTIMIZE does not fix the root cause if the writer originally wrote approximate statistics. (70% fail)
