llm
config_error
ai_generated
true
KeyError: 'tokenizer_vocab_size' not found in model config for fine-tuning
ID: llm/tokenizer-vocab-mismatch-fine-tune
85%Fix Rate
88%Confidence
1Evidence
2024-01-20First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| transformers 4.36.0 | active | — | — | — |
| transformers 4.37.0 | active | — | — | — |
| PyTorch 2.1.0 | active | — | — | — |
Root Cause
Mismatch between tokenizer vocabulary size and model embedding layer size when loading a pre-trained model for fine-tuning with a custom tokenizer.
generic中文
在使用自定义分词器加载预训练模型进行微调时,分词器词汇表大小与模型嵌入层大小不匹配。
Official Documentation
https://huggingface.co/docs/transformers/trainingWorkarounds
-
95% success Resize tokenizer embeddings before training: model.resize_token_embeddings(len(tokenizer))
Resize tokenizer embeddings before training: model.resize_token_embeddings(len(tokenizer))
-
80% success Use the default tokenizer that comes with the pre-trained model instead of a custom one
Use the default tokenizer that comes with the pre-trained model instead of a custom one
中文步骤
Resize tokenizer embeddings before training: model.resize_token_embeddings(len(tokenizer))
Use the default tokenizer that comes with the pre-trained model instead of a custom one
Dead Ends
Common approaches that don't work:
-
Setting tokenizer_vocab_size manually in config to match tokenizer size
95% fail
Model embedding layer weights are fixed; resizing requires special method, not config change.
-
Reinstalling transformers package
80% fail
Error is configuration-related, not installation-related.