data data_error ai_generated partial

Parquet dictionary encoding collision — distinct string values map to same dictionary key

ID: data/parquet-dictionary-encoding-collision

Also available as: JSON · Markdown · 中文

70%Fix Rate

80%Confidence

1Evidence

2024-01-10First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
parquet-mr 1.12.3	active	—	—	—
parquet-cpp 1.5.1	active	—	—	—
pyarrow 12.0.0	active	—	—	—

Root Cause

In rare cases, Parquet writers using dictionary encoding may produce collisions when string values differ only by trailing whitespace or invisible Unicode characters, due to hash collisions or normalization issues.

generic

中文

在极少数情况下，使用字典编码的 Parquet 写入器可能因哈希冲突或规范化问题，在仅尾部空白或不可见 Unicode 字符不同的字符串值之间产生冲突。

Official Documentation

https://parquet.apache.org/docs/file-format/data-pages/

Workarounds

85% success Normalize string values before writing: df['col'] = df['col'].str.strip().str.normalize('NFKC'). This removes trailing whitespace and normalizes Unicode. Example: import unicodedata; df['col'] = df['col'].apply(lambda x: unicodedata.normalize('NFKC', x.strip()))
```
Normalize string values before writing: df['col'] = df['col'].str.strip().str.normalize('NFKC'). This removes trailing whitespace and normalizes Unicode. Example: import unicodedata; df['col'] = df['col'].apply(lambda x: unicodedata.normalize('NFKC', x.strip()))
```
75% success Increase dictionary page size or disable dictionary for the affected column only: pyarrow.parquet.write_table(table, 'file.parquet', use_dictionary=['col1', 'col2']) — omit the problematic column from dictionary encoding.
```
Increase dictionary page size or disable dictionary for the affected column only: pyarrow.parquet.write_table(table, 'file.parquet', use_dictionary=['col1', 'col2']) — omit the problematic column from dictionary encoding.
```

中文步骤

Normalize string values before writing: df['col'] = df['col'].str.strip().str.normalize('NFKC'). This removes trailing whitespace and normalizes Unicode. Example: import unicodedata; df['col'] = df['col'].apply(lambda x: unicodedata.normalize('NFKC', x.strip()))

Increase dictionary page size or disable dictionary for the affected column only: pyarrow.parquet.write_table(table, 'file.parquet', use_dictionary=['col1', 'col2']) — omit the problematic column from dictionary encoding.

Dead Ends

Common approaches that don't work:

90% fail
The collision is deterministic given the same input data and encoding settings, so it will reproduce.
50% fail
Fixes collision but increases file size and read performance, potentially causing new issues with downstream systems expecting dictionary encoding.