data data_error ai_generated partial

Parquet dictionary encoding collision — distinct string values map to same dictionary key

ID: data/parquet-dictionary-encoding-collision

Also available as: JSON · Markdown · 中文
70%Fix Rate
80%Confidence
1Evidence
2024-01-10First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
parquet-mr 1.12.3 active
parquet-cpp 1.5.1 active
pyarrow 12.0.0 active

Root Cause

In rare cases, Parquet writers using dictionary encoding may produce collisions when string values differ only by trailing whitespace or invisible Unicode characters, due to hash collisions or normalization issues.

generic

中文

在极少数情况下,使用字典编码的 Parquet 写入器可能因哈希冲突或规范化问题,在仅尾部空白或不可见 Unicode 字符不同的字符串值之间产生冲突。

Official Documentation

https://parquet.apache.org/docs/file-format/data-pages/

Workarounds

  1. 85% success Normalize string values before writing: df['col'] = df['col'].str.strip().str.normalize('NFKC'). This removes trailing whitespace and normalizes Unicode. Example: import unicodedata; df['col'] = df['col'].apply(lambda x: unicodedata.normalize('NFKC', x.strip()))
    Normalize string values before writing: df['col'] = df['col'].str.strip().str.normalize('NFKC'). This removes trailing whitespace and normalizes Unicode. Example: import unicodedata; df['col'] = df['col'].apply(lambda x: unicodedata.normalize('NFKC', x.strip()))
  2. 75% success Increase dictionary page size or disable dictionary for the affected column only: pyarrow.parquet.write_table(table, 'file.parquet', use_dictionary=['col1', 'col2']) — omit the problematic column from dictionary encoding.
    Increase dictionary page size or disable dictionary for the affected column only: pyarrow.parquet.write_table(table, 'file.parquet', use_dictionary=['col1', 'col2']) — omit the problematic column from dictionary encoding.

中文步骤

  1. Normalize string values before writing: df['col'] = df['col'].str.strip().str.normalize('NFKC'). This removes trailing whitespace and normalizes Unicode. Example: import unicodedata; df['col'] = df['col'].apply(lambda x: unicodedata.normalize('NFKC', x.strip()))
  2. Increase dictionary page size or disable dictionary for the affected column only: pyarrow.parquet.write_table(table, 'file.parquet', use_dictionary=['col1', 'col2']) — omit the problematic column from dictionary encoding.

Dead Ends

Common approaches that don't work:

  1. 90% fail

    The collision is deterministic given the same input data and encoding settings, so it will reproduce.

  2. 50% fail

    Fixes collision but increases file size and read performance, potentially causing new issues with downstream systems expecting dictionary encoding.