# KeyError: 模型配置中未找到 'tokenizer_vocab_size'

- **ID:** `llm/tokenizer-vocab-mismatch`
- **领域:** llm
- **类别:** config_error
- **验证级别:** ai_generated
- **修复率:** 90%

## 根因

微调或加载模型时，分词器配置文件缺少 'tokenizer_vocab_size' 键，通常是由于使用了不匹配的分词器或不完整的 Hugging Face 模型卡。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| transformers==4.35.0 | active | — | — |
| transformers==4.38.0 | active | — | — |
| llama-2-7b-hf | active | — | — |
| mistral-7b-v0.1 | active | — | — |

## 解决方案

1. ```
   单独加载分词器并手动设置配置：`from transformers import AutoTokenizer; tokenizer = AutoTokenizer.from_pretrained('model_name'); model.config.vocab_size = len(tokenizer)`
   ```
2. ```
   使用包含分词器配置的不同模型变体（例如，优先使用 Hugging Face 的 '-hf' 变体）。
   ```
3. ```
   从 Hugging Face 下载完整的模型目录（包括分词器文件），而不是使用部分或缓存版本。
   ```

## 无效尝试

- **** — The value must match the actual tokenizer vocabulary size; an arbitrary value will cause embedding dimension mismatches or runtime errors. (90% 失败率)
- **** — The error is a configuration issue with the model, not a library installation problem. (95% 失败率)