OutOfMemoryError data resource_error ai_generated true

Parquet row group size mismatch causes memory error during read

ID: data/parquet-row-group-size-mismatch

Also available as: JSON · Markdown · 中文
86%Fix Rate
84%Confidence
1Evidence
2024-02-28First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
Apache Spark 3.4 active
Apache Spark 3.5 active
pyarrow 13.0.0 active
pandas 2.1.4 active

Root Cause

Parquet files written with very large row group sizes (e.g., >1GB) cause memory exhaustion when read by systems that load entire row groups into memory, especially in memory-constrained environments like Spark executors or pandas.

generic

中文

写入时行组过大(例如>1GB)的Parquet文件,在被需要将整个行组加载到内存的系统读取时(尤其是在Spark执行器或pandas等内存受限的环境中),会导致内存耗尽。

Official Documentation

https://parquet.apache.org/docs/file-format/

Workarounds

  1. 90% success Rewrite the Parquet file with smaller row group size: `import pyarrow.parquet as pq; table = pq.read_table('large.parquet'); pq.write_table(table, 'small.parquet', row_group_size=100000)`
    Rewrite the Parquet file with smaller row group size: `import pyarrow.parquet as pq; table = pq.read_table('large.parquet'); pq.write_table(table, 'small.parquet', row_group_size=100000)`
  2. 85% success Use Spark with parquet.row-group-size-bytes set: `spark.conf.set('spark.sql.parquet.rowGroupSize', 256 * 1024 * 1024)` and repartition the data before writing
    Use Spark with parquet.row-group-size-bytes set: `spark.conf.set('spark.sql.parquet.rowGroupSize', 256 * 1024 * 1024)` and repartition the data before writing
  3. 88% success Read the Parquet file in chunks using pyarrow's read_row_group method: `for i in range(parquet_file.metadata.num_row_groups): table = parquet_file.read_row_group(i)`
    Read the Parquet file in chunks using pyarrow's read_row_group method: `for i in range(parquet_file.metadata.num_row_groups): table = parquet_file.read_row_group(i)`

中文步骤

  1. Rewrite the Parquet file with smaller row group size: `import pyarrow.parquet as pq; table = pq.read_table('large.parquet'); pq.write_table(table, 'small.parquet', row_group_size=100000)`
  2. Use Spark with parquet.row-group-size-bytes set: `spark.conf.set('spark.sql.parquet.rowGroupSize', 256 * 1024 * 1024)` and repartition the data before writing
  3. Read the Parquet file in chunks using pyarrow's read_row_group method: `for i in range(parquet_file.metadata.num_row_groups): table = parquet_file.read_row_group(i)`

Dead Ends

Common approaches that don't work:

  1. 65% fail

    This only postpones the problem; row groups can grow unboundedly and may still exceed the increased heap, causing OOM again.

  2. 80% fail

    This setting adjusts shuffle partitions, not Parquet row group sizes; it does not affect how row groups are read.

  3. 75% fail

    Compression reduces file size on disk but row groups are decompressed in memory; the memory footprint remains the same.