data data_error ai_generated partial

Parquet 字典编码冲突——不同字符串值映射到相同字典键

Parquet dictionary encoding collision — distinct string values map to same dictionary key

ID: data/parquet-dictionary-encoding-collision

其他格式: JSON · Markdown 中文 · English

70%修复率

80%置信度

1证据数

2024-01-10首次发现

版本兼容性

版本	状态	引入	弃用	备注
parquet-mr 1.12.3	active	—	—	—
parquet-cpp 1.5.1	active	—	—	—
pyarrow 12.0.0	active	—	—	—

根因分析

在极少数情况下，使用字典编码的 Parquet 写入器可能因哈希冲突或规范化问题，在仅尾部空白或不可见 Unicode 字符不同的字符串值之间产生冲突。

English

In rare cases, Parquet writers using dictionary encoding may produce collisions when string values differ only by trailing whitespace or invisible Unicode characters, due to hash collisions or normalization issues.

generic

官方文档

https://parquet.apache.org/docs/file-format/data-pages/

解决方案

Normalize string values before writing: df['col'] = df['col'].str.strip().str.normalize('NFKC'). This removes trailing whitespace and normalizes Unicode. Example: import unicodedata; df['col'] = df['col'].apply(lambda x: unicodedata.normalize('NFKC', x.strip()))

Increase dictionary page size or disable dictionary for the affected column only: pyarrow.parquet.write_table(table, 'file.parquet', use_dictionary=['col1', 'col2']) — omit the problematic column from dictionary encoding.

无效尝试

常见但无效的做法:

90% 失败
The collision is deterministic given the same input data and encoding settings, so it will reproduce.
50% 失败
Fixes collision but increases file size and read performance, potentially causing new issues with downstream systems expecting dictionary encoding.