ParquetBloomFilterHashMismatch data data_error ai_generated partial

Parquet布隆过滤器哈希不匹配：意外的哈希算法ID 0

Parquet bloom filter hash mismatch: unexpected hash algorithm ID 0

ID: data/parquet-bloom-filter-corruption

其他格式: JSON · Markdown 中文 · English

78%修复率

85%置信度

1证据数

2024-03-15首次发现

版本兼容性

版本	状态	引入	弃用	备注
Parquet 2.10+	active	—	—	—
Apache Arrow 12.0+	active	—	—	—
PyArrow 14.0+	active	—	—	—
Spark 3.5+	active	—	—	—

根因分析

由旧版本库写入的Parquet文件使用了不支持或未注册的布隆过滤器哈希算法，导致在新版读取器中因严格校验算法ID而读取失败。

English

Parquet file written by an older version of a library uses an unsupported or unregistered bloom filter hash algorithm, causing read failures in newer readers that strictly validate algorithm IDs.

generic

官方文档

https://parquet.apache.org/docs/file-format/bloom-filter/

解决方案

使用更新版本的写入库重写Parquet文件，该库将算法ID 0注册为已知算法。对于PyArrow：`import pyarrow.parquet as pq; table = pq.read_table('file.parquet'); pq.write_table(table, 'file_fixed.parquet')`

使用不验证布隆过滤器算法ID的旧版读取器。对于Spark：降级到Spark 3.4或更早版本，然后读取并重写文件。

使用parquet-tools从Parquet文件中剥离布隆过滤器：`java -jar parquet-tools-1.12.3.jar meta file.parquet | grep bloom`，然后使用自定义脚本删除布隆过滤器元数据页面。

无效尝试

常见但无效的做法:

95% 失败
The hash algorithm ID is an on-disk property of the Parquet file; updating the reader library does not change the file's content.
90% 失败
If the original writer is still the old library, the regenerated file will have the same unsupported algorithm ID.
75% 失败
This flag only disables bloom filter creation, not reading. The reader still attempts to parse existing bloom filters.