llm config_error ai_generated true

KeyError: 模型配置中未找到 'tokenizer_vocab_size'

KeyError: 'tokenizer_vocab_size' not found in model config

ID: llm/tokenizer-vocab-mismatch

其他格式: JSON · Markdown 中文 · English

90%修复率

87%置信度

1证据数

2023-09-01首次发现

版本兼容性

版本	状态	引入	弃用	备注
transformers==4.35.0	active	—	—	—
transformers==4.38.0	active	—	—	—
llama-2-7b-hf	active	—	—	—
mistral-7b-v0.1	active	—	—	—

根因分析

微调或加载模型时，分词器配置文件缺少 'tokenizer_vocab_size' 键，通常是由于使用了不匹配的分词器或不完整的 Hugging Face 模型卡。

English

When fine-tuning or loading a model, the tokenizer configuration file is missing the 'tokenizer_vocab_size' key, often due to using a mismatched tokenizer or an incomplete model card from Hugging Face.

generic

官方文档

https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForCausalLM

解决方案

单独加载分词器并手动设置配置：`from transformers import AutoTokenizer; tokenizer = AutoTokenizer.from_pretrained('model_name'); model.config.vocab_size = len(tokenizer)`

使用包含分词器配置的不同模型变体（例如，优先使用 Hugging Face 的 '-hf' 变体）。

从 Hugging Face 下载完整的模型目录（包括分词器文件），而不是使用部分或缓存版本。

无效尝试

常见但无效的做法:

90% 失败
The value must match the actual tokenizer vocabulary size; an arbitrary value will cause embedding dimension mismatches or runtime errors.
95% 失败
The error is a configuration issue with the model, not a library installation problem.