# Parquet 字典编码冲突——不同字符串值映射到相同字典键

- **ID:** `data/parquet-dictionary-encoding-collision`
- **领域:** data
- **类别:** data_error
- **验证级别:** ai_generated
- **修复率:** 70%

## 根因

在极少数情况下，使用字典编码的 Parquet 写入器可能因哈希冲突或规范化问题，在仅尾部空白或不可见 Unicode 字符不同的字符串值之间产生冲突。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| parquet-mr 1.12.3 | active | — | — |
| parquet-cpp 1.5.1 | active | — | — |
| pyarrow 12.0.0 | active | — | — |

## 解决方案

1. ```
   Normalize string values before writing: df['col'] = df['col'].str.strip().str.normalize('NFKC'). This removes trailing whitespace and normalizes Unicode. Example: import unicodedata; df['col'] = df['col'].apply(lambda x: unicodedata.normalize('NFKC', x.strip()))
   ```
2. ```
   Increase dictionary page size or disable dictionary for the affected column only: pyarrow.parquet.write_table(table, 'file.parquet', use_dictionary=['col1', 'col2']) — omit the problematic column from dictionary encoding.
   ```

## 无效尝试

- **** — The collision is deterministic given the same input data and encoding settings, so it will reproduce. (90% 失败率)
- **** — Fixes collision but increases file size and read performance, potentially causing new issues with downstream systems expecting dictionary encoding. (50% 失败率)
