OutOfMemoryError data resource_error ai_generated true

Parquet row group size mismatch causes memory error during read

ID: data/parquet-row-group-size-mismatch

Also available as: JSON · Markdown · 中文

86%Fix Rate

84%Confidence

1Evidence

2024-02-28First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
Apache Spark 3.4	active	—	—	—
Apache Spark 3.5	active	—	—	—
pyarrow 13.0.0	active	—	—	—
pandas 2.1.4	active	—	—	—

Root Cause

Parquet files written with very large row group sizes (e.g., >1GB) cause memory exhaustion when read by systems that load entire row groups into memory, especially in memory-constrained environments like Spark executors or pandas.

generic

中文

写入时行组过大（例如>1GB）的Parquet文件，在被需要将整个行组加载到内存的系统读取时（尤其是在Spark执行器或pandas等内存受限的环境中），会导致内存耗尽。

Official Documentation

https://parquet.apache.org/docs/file-format/

Workarounds

90% success Rewrite the Parquet file with smaller row group size: `import pyarrow.parquet as pq; table = pq.read_table('large.parquet'); pq.write_table(table, 'small.parquet', row_group_size=100000)`
```
Rewrite the Parquet file with smaller row group size: `import pyarrow.parquet as pq; table = pq.read_table('large.parquet'); pq.write_table(table, 'small.parquet', row_group_size=100000)`
```
85% success Use Spark with parquet.row-group-size-bytes set: `spark.conf.set('spark.sql.parquet.rowGroupSize', 256 * 1024 * 1024)` and repartition the data before writing
```
Use Spark with parquet.row-group-size-bytes set: `spark.conf.set('spark.sql.parquet.rowGroupSize', 256 * 1024 * 1024)` and repartition the data before writing
```
88% success Read the Parquet file in chunks using pyarrow's read_row_group method: `for i in range(parquet_file.metadata.num_row_groups): table = parquet_file.read_row_group(i)`
```
Read the Parquet file in chunks using pyarrow's read_row_group method: `for i in range(parquet_file.metadata.num_row_groups): table = parquet_file.read_row_group(i)`
```

中文步骤

Rewrite the Parquet file with smaller row group size: `import pyarrow.parquet as pq; table = pq.read_table('large.parquet'); pq.write_table(table, 'small.parquet', row_group_size=100000)`

Use Spark with parquet.row-group-size-bytes set: `spark.conf.set('spark.sql.parquet.rowGroupSize', 256 * 1024 * 1024)` and repartition the data before writing

Read the Parquet file in chunks using pyarrow's read_row_group method: `for i in range(parquet_file.metadata.num_row_groups): table = parquet_file.read_row_group(i)`

Dead Ends

Common approaches that don't work:

65% fail
This only postpones the problem; row groups can grow unboundedly and may still exceed the increased heap, causing OOM again.
80% fail
This setting adjusts shuffle partitions, not Parquet row group sizes; it does not affect how row groups are read.
75% fail
Compression reduces file size on disk but row groups are decompressed in memory; the memory footprint remains the same.