data
data_error
ai_generated
partial
Parquet 字典编码冲突——不同字符串值映射到相同字典键
Parquet dictionary encoding collision — distinct string values map to same dictionary key
ID: data/parquet-dictionary-encoding-collision
70%修复率
80%置信度
1证据数
2024-01-10首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| parquet-mr 1.12.3 | active | — | — | — |
| parquet-cpp 1.5.1 | active | — | — | — |
| pyarrow 12.0.0 | active | — | — | — |
根因分析
在极少数情况下,使用字典编码的 Parquet 写入器可能因哈希冲突或规范化问题,在仅尾部空白或不可见 Unicode 字符不同的字符串值之间产生冲突。
English
In rare cases, Parquet writers using dictionary encoding may produce collisions when string values differ only by trailing whitespace or invisible Unicode characters, due to hash collisions or normalization issues.
官方文档
https://parquet.apache.org/docs/file-format/data-pages/解决方案
-
Normalize string values before writing: df['col'] = df['col'].str.strip().str.normalize('NFKC'). This removes trailing whitespace and normalizes Unicode. Example: import unicodedata; df['col'] = df['col'].apply(lambda x: unicodedata.normalize('NFKC', x.strip())) -
Increase dictionary page size or disable dictionary for the affected column only: pyarrow.parquet.write_table(table, 'file.parquet', use_dictionary=['col1', 'col2']) — omit the problematic column from dictionary encoding.
无效尝试
常见但无效的做法:
-
90% 失败
The collision is deterministic given the same input data and encoding settings, so it will reproduce.
-
50% 失败
Fixes collision but increases file size and read performance, potentially causing new issues with downstream systems expecting dictionary encoding.