# Parquet dictionary encoding collision — distinct string values map to same dictionary key

- **ID:** `data/parquet-dictionary-encoding-collision`
- **Domain:** data
- **Category:** data_error
- **Verification:** ai_generated
- **Fix Rate:** 70%

## Root Cause

In rare cases, Parquet writers using dictionary encoding may produce collisions when string values differ only by trailing whitespace or invisible Unicode characters, due to hash collisions or normalization issues.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| parquet-mr 1.12.3 | active | — | — |
| parquet-cpp 1.5.1 | active | — | — |
| pyarrow 12.0.0 | active | — | — |

## Workarounds

1. **Normalize string values before writing: df['col'] = df['col'].str.strip().str.normalize('NFKC'). This removes trailing whitespace and normalizes Unicode. Example: import unicodedata; df['col'] = df['col'].apply(lambda x: unicodedata.normalize('NFKC', x.strip()))** (85% success)
   ```
   Normalize string values before writing: df['col'] = df['col'].str.strip().str.normalize('NFKC'). This removes trailing whitespace and normalizes Unicode. Example: import unicodedata; df['col'] = df['col'].apply(lambda x: unicodedata.normalize('NFKC', x.strip()))
   ```
2. **Increase dictionary page size or disable dictionary for the affected column only: pyarrow.parquet.write_table(table, 'file.parquet', use_dictionary=['col1', 'col2']) — omit the problematic column from dictionary encoding.** (75% success)
   ```
   Increase dictionary page size or disable dictionary for the affected column only: pyarrow.parquet.write_table(table, 'file.parquet', use_dictionary=['col1', 'col2']) — omit the problematic column from dictionary encoding.
   ```

## Dead Ends

- **** — The collision is deterministic given the same input data and encoding settings, so it will reproduce. (90% fail)
- **** — Fixes collision but increases file size and read performance, potentially causing new issues with downstream systems expecting dictionary encoding. (50% fail)
