ParquetBloomFilterHashMismatch
data
data_error
ai_generated
partial
Parquet bloom filter hash mismatch: unexpected hash algorithm ID 0
ID: data/parquet-bloom-filter-corruption
78%Fix Rate
85%Confidence
1Evidence
2024-03-15First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| Parquet 2.10+ | active | — | — | — |
| Apache Arrow 12.0+ | active | — | — | — |
| PyArrow 14.0+ | active | — | — | — |
| Spark 3.5+ | active | — | — | — |
Root Cause
Parquet file written by an older version of a library uses an unsupported or unregistered bloom filter hash algorithm, causing read failures in newer readers that strictly validate algorithm IDs.
generic中文
由旧版本库写入的Parquet文件使用了不支持或未注册的布隆过滤器哈希算法,导致在新版读取器中因严格校验算法ID而读取失败。
Official Documentation
https://parquet.apache.org/docs/file-format/bloom-filter/Workarounds
-
85% success Rewrite the Parquet file using a newer version of the writer library that registers algorithm ID 0 as a known algorithm. For PyArrow: `import pyarrow.parquet as pq; table = pq.read_table('file.parquet'); pq.write_table(table, 'file_fixed.parquet')`
Rewrite the Parquet file using a newer version of the writer library that registers algorithm ID 0 as a known algorithm. For PyArrow: `import pyarrow.parquet as pq; table = pq.read_table('file.parquet'); pq.write_table(table, 'file_fixed.parquet')` -
70% success Use an older reader that does not validate bloom filter algorithm IDs. For Spark: downgrade to Spark 3.4 or earlier, then read and rewrite the file.
Use an older reader that does not validate bloom filter algorithm IDs. For Spark: downgrade to Spark 3.4 or earlier, then read and rewrite the file.
-
65% success Strip bloom filters from the Parquet file using parquet-tools: `java -jar parquet-tools-1.12.3.jar meta file.parquet | grep bloom`, then use a custom script to remove the bloom filter metadata pages.
Strip bloom filters from the Parquet file using parquet-tools: `java -jar parquet-tools-1.12.3.jar meta file.parquet | grep bloom`, then use a custom script to remove the bloom filter metadata pages.
中文步骤
使用更新版本的写入库重写Parquet文件,该库将算法ID 0注册为已知算法。对于PyArrow:`import pyarrow.parquet as pq; table = pq.read_table('file.parquet'); pq.write_table(table, 'file_fixed.parquet')`使用不验证布隆过滤器算法ID的旧版读取器。对于Spark:降级到Spark 3.4或更早版本,然后读取并重写文件。
使用parquet-tools从Parquet文件中剥离布隆过滤器:`java -jar parquet-tools-1.12.3.jar meta file.parquet | grep bloom`,然后使用自定义脚本删除布隆过滤器元数据页面。
Dead Ends
Common approaches that don't work:
-
95% fail
The hash algorithm ID is an on-disk property of the Parquet file; updating the reader library does not change the file's content.
-
90% fail
If the original writer is still the old library, the regenerated file will have the same unsupported algorithm ID.
-
75% fail
This flag only disables bloom filter creation, not reading. The reader still attempts to parse existing bloom filters.