llm
config_error
ai_generated
true
键错误:微调模型配置中未找到'tokenizer_vocab_size'。
KeyError: 'tokenizer_vocab_size' not found in model config for fine-tuning
ID: llm/tokenizer-vocab-mismatch-fine-tune
85%修复率
88%置信度
1证据数
2024-01-20首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| transformers 4.36.0 | active | — | — | — |
| transformers 4.37.0 | active | — | — | — |
| PyTorch 2.1.0 | active | — | — | — |
根因分析
在使用自定义分词器加载预训练模型进行微调时,分词器词汇表大小与模型嵌入层大小不匹配。
English
Mismatch between tokenizer vocabulary size and model embedding layer size when loading a pre-trained model for fine-tuning with a custom tokenizer.
官方文档
https://huggingface.co/docs/transformers/training解决方案
-
Resize tokenizer embeddings before training: model.resize_token_embeddings(len(tokenizer))
-
Use the default tokenizer that comes with the pre-trained model instead of a custom one
无效尝试
常见但无效的做法:
-
Setting tokenizer_vocab_size manually in config to match tokenizer size
95% 失败
Model embedding layer weights are fixed; resizing requires special method, not config change.
-
Reinstalling transformers package
80% 失败
Error is configuration-related, not installation-related.