OutOfMemoryError data resource_error ai_generated true

Parquet行组大小不匹配导致读取时内存错误

Parquet row group size mismatch causes memory error during read

ID: data/parquet-row-group-size-mismatch

其他格式: JSON · Markdown 中文 · English
86%修复率
84%置信度
1证据数
2024-02-28首次发现

版本兼容性

版本状态引入弃用备注
Apache Spark 3.4 active
Apache Spark 3.5 active
pyarrow 13.0.0 active
pandas 2.1.4 active

根因分析

写入时行组过大(例如>1GB)的Parquet文件,在被需要将整个行组加载到内存的系统读取时(尤其是在Spark执行器或pandas等内存受限的环境中),会导致内存耗尽。

English

Parquet files written with very large row group sizes (e.g., >1GB) cause memory exhaustion when read by systems that load entire row groups into memory, especially in memory-constrained environments like Spark executors or pandas.

generic

官方文档

https://parquet.apache.org/docs/file-format/

解决方案

  1. Rewrite the Parquet file with smaller row group size: `import pyarrow.parquet as pq; table = pq.read_table('large.parquet'); pq.write_table(table, 'small.parquet', row_group_size=100000)`
  2. Use Spark with parquet.row-group-size-bytes set: `spark.conf.set('spark.sql.parquet.rowGroupSize', 256 * 1024 * 1024)` and repartition the data before writing
  3. Read the Parquet file in chunks using pyarrow's read_row_group method: `for i in range(parquet_file.metadata.num_row_groups): table = parquet_file.read_row_group(i)`

无效尝试

常见但无效的做法:

  1. 65% 失败

    This only postpones the problem; row groups can grow unboundedly and may still exceed the increased heap, causing OOM again.

  2. 80% 失败

    This setting adjusts shuffle partitions, not Parquet row group sizes; it does not affect how row groups are read.

  3. 75% 失败

    Compression reduces file size on disk but row groups are decompressed in memory; the memory footprint remains the same.