OutOfMemoryError
data
resource_error
ai_generated
true
Parquet row group size mismatch causes memory error during read
ID: data/parquet-row-group-size-mismatch
86%Fix Rate
84%Confidence
1Evidence
2024-02-28First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| Apache Spark 3.4 | active | — | — | — |
| Apache Spark 3.5 | active | — | — | — |
| pyarrow 13.0.0 | active | — | — | — |
| pandas 2.1.4 | active | — | — | — |
Root Cause
Parquet files written with very large row group sizes (e.g., >1GB) cause memory exhaustion when read by systems that load entire row groups into memory, especially in memory-constrained environments like Spark executors or pandas.
generic中文
写入时行组过大(例如>1GB)的Parquet文件,在被需要将整个行组加载到内存的系统读取时(尤其是在Spark执行器或pandas等内存受限的环境中),会导致内存耗尽。
Official Documentation
https://parquet.apache.org/docs/file-format/Workarounds
-
90% success Rewrite the Parquet file with smaller row group size: `import pyarrow.parquet as pq; table = pq.read_table('large.parquet'); pq.write_table(table, 'small.parquet', row_group_size=100000)`
Rewrite the Parquet file with smaller row group size: `import pyarrow.parquet as pq; table = pq.read_table('large.parquet'); pq.write_table(table, 'small.parquet', row_group_size=100000)` -
85% success Use Spark with parquet.row-group-size-bytes set: `spark.conf.set('spark.sql.parquet.rowGroupSize', 256 * 1024 * 1024)` and repartition the data before writing
Use Spark with parquet.row-group-size-bytes set: `spark.conf.set('spark.sql.parquet.rowGroupSize', 256 * 1024 * 1024)` and repartition the data before writing -
88% success Read the Parquet file in chunks using pyarrow's read_row_group method: `for i in range(parquet_file.metadata.num_row_groups): table = parquet_file.read_row_group(i)`
Read the Parquet file in chunks using pyarrow's read_row_group method: `for i in range(parquet_file.metadata.num_row_groups): table = parquet_file.read_row_group(i)`
中文步骤
Rewrite the Parquet file with smaller row group size: `import pyarrow.parquet as pq; table = pq.read_table('large.parquet'); pq.write_table(table, 'small.parquet', row_group_size=100000)`Use Spark with parquet.row-group-size-bytes set: `spark.conf.set('spark.sql.parquet.rowGroupSize', 256 * 1024 * 1024)` and repartition the data before writingRead the Parquet file in chunks using pyarrow's read_row_group method: `for i in range(parquet_file.metadata.num_row_groups): table = parquet_file.read_row_group(i)`
Dead Ends
Common approaches that don't work:
-
65% fail
This only postpones the problem; row groups can grow unboundedly and may still exceed the increased heap, causing OOM again.
-
80% fail
This setting adjusts shuffle partitions, not Parquet row group sizes; it does not affect how row groups are read.
-
75% fail
Compression reduces file size on disk but row groups are decompressed in memory; the memory footprint remains the same.