OutOfMemoryError data resource_error ai_generated true

Parquet行组大小不匹配导致读取时内存错误

Parquet row group size mismatch causes memory error during read

ID: data/parquet-row-group-size-mismatch

其他格式: JSON · Markdown 中文 · English

86%修复率

84%置信度

1证据数

2024-02-28首次发现

版本兼容性

版本	状态	引入	弃用	备注
Apache Spark 3.4	active	—	—	—
Apache Spark 3.5	active	—	—	—
pyarrow 13.0.0	active	—	—	—
pandas 2.1.4	active	—	—	—

根因分析

写入时行组过大（例如>1GB）的Parquet文件，在被需要将整个行组加载到内存的系统读取时（尤其是在Spark执行器或pandas等内存受限的环境中），会导致内存耗尽。

English

Parquet files written with very large row group sizes (e.g., >1GB) cause memory exhaustion when read by systems that load entire row groups into memory, especially in memory-constrained environments like Spark executors or pandas.

generic

官方文档

https://parquet.apache.org/docs/file-format/

解决方案

Rewrite the Parquet file with smaller row group size: `import pyarrow.parquet as pq; table = pq.read_table('large.parquet'); pq.write_table(table, 'small.parquet', row_group_size=100000)`

Use Spark with parquet.row-group-size-bytes set: `spark.conf.set('spark.sql.parquet.rowGroupSize', 256 * 1024 * 1024)` and repartition the data before writing

Read the Parquet file in chunks using pyarrow's read_row_group method: `for i in range(parquet_file.metadata.num_row_groups): table = parquet_file.read_row_group(i)`

无效尝试

常见但无效的做法:

65% 失败
This only postpones the problem; row groups can grow unboundedly and may still exceed the increased heap, causing OOM again.
80% 失败
This setting adjusts shuffle partitions, not Parquet row group sizes; it does not affect how row groups are read.
75% 失败
Compression reduces file size on disk but row groups are decompressed in memory; the memory footprint remains the same.