# Parquet row group size mismatch causes memory error during read

- **ID:** `data/parquet-row-group-size-mismatch`
- **Domain:** data
- **Category:** resource_error
- **Error Code:** `OutOfMemoryError`
- **Verification:** ai_generated
- **Fix Rate:** 86%

## Root Cause

Parquet files written with very large row group sizes (e.g., >1GB) cause memory exhaustion when read by systems that load entire row groups into memory, especially in memory-constrained environments like Spark executors or pandas.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| Apache Spark 3.4 | active | — | — |
| Apache Spark 3.5 | active | — | — |
| pyarrow 13.0.0 | active | — | — |
| pandas 2.1.4 | active | — | — |

## Workarounds

1. **Rewrite the Parquet file with smaller row group size: `import pyarrow.parquet as pq; table = pq.read_table('large.parquet'); pq.write_table(table, 'small.parquet', row_group_size=100000)`** (90% success)
   ```
   Rewrite the Parquet file with smaller row group size: `import pyarrow.parquet as pq; table = pq.read_table('large.parquet'); pq.write_table(table, 'small.parquet', row_group_size=100000)`
   ```
2. **Use Spark with parquet.row-group-size-bytes set: `spark.conf.set('spark.sql.parquet.rowGroupSize', 256 * 1024 * 1024)` and repartition the data before writing** (85% success)
   ```
   Use Spark with parquet.row-group-size-bytes set: `spark.conf.set('spark.sql.parquet.rowGroupSize', 256 * 1024 * 1024)` and repartition the data before writing
   ```
3. **Read the Parquet file in chunks using pyarrow's read_row_group method: `for i in range(parquet_file.metadata.num_row_groups): table = parquet_file.read_row_group(i)`** (88% success)
   ```
   Read the Parquet file in chunks using pyarrow's read_row_group method: `for i in range(parquet_file.metadata.num_row_groups): table = parquet_file.read_row_group(i)`
   ```

## Dead Ends

- **** — This only postpones the problem; row groups can grow unboundedly and may still exceed the increased heap, causing OOM again. (65% fail)
- **** — This setting adjusts shuffle partitions, not Parquet row group sizes; it does not affect how row groups are read. (80% fail)
- **** — Compression reduces file size on disk but row groups are decompressed in memory; the memory footprint remains the same. (75% fail)
