Parquet行组大小不匹配导致读取时内存错误
Parquet row group size mismatch causes memory error during read
ID: data/parquet-row-group-size-mismatch
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| Apache Spark 3.4 | active | — | — | — |
| Apache Spark 3.5 | active | — | — | — |
| pyarrow 13.0.0 | active | — | — | — |
| pandas 2.1.4 | active | — | — | — |
根因分析
写入时行组过大(例如>1GB)的Parquet文件,在被需要将整个行组加载到内存的系统读取时(尤其是在Spark执行器或pandas等内存受限的环境中),会导致内存耗尽。
English
Parquet files written with very large row group sizes (e.g., >1GB) cause memory exhaustion when read by systems that load entire row groups into memory, especially in memory-constrained environments like Spark executors or pandas.
官方文档
https://parquet.apache.org/docs/file-format/解决方案
-
Rewrite the Parquet file with smaller row group size: `import pyarrow.parquet as pq; table = pq.read_table('large.parquet'); pq.write_table(table, 'small.parquet', row_group_size=100000)` -
Use Spark with parquet.row-group-size-bytes set: `spark.conf.set('spark.sql.parquet.rowGroupSize', 256 * 1024 * 1024)` and repartition the data before writing -
Read the Parquet file in chunks using pyarrow's read_row_group method: `for i in range(parquet_file.metadata.num_row_groups): table = parquet_file.read_row_group(i)`
无效尝试
常见但无效的做法:
-
65% 失败
This only postpones the problem; row groups can grow unboundedly and may still exceed the increased heap, causing OOM again.
-
80% 失败
This setting adjusts shuffle partitions, not Parquet row group sizes; it does not affect how row groups are read.
-
75% 失败
Compression reduces file size on disk but row groups are decompressed in memory; the memory footprint remains the same.