llm config_error ai_generated true

KeyError: 'tokenizer_vocab_size' not found in model config for fine-tuning

ID: llm/tokenizer-vocab-mismatch-fine-tune

Also available as: JSON · Markdown · 中文

85%Fix Rate

88%Confidence

1Evidence

2024-01-20First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
transformers 4.36.0	active	—	—	—
transformers 4.37.0	active	—	—	—
PyTorch 2.1.0	active	—	—	—

Mismatch between tokenizer vocabulary size and model embedding layer size when loading a pre-trained model for fine-tuning with a custom tokenizer.

generic

在使用自定义分词器加载预训练模型进行微调时，分词器词汇表大小与模型嵌入层大小不匹配。

95% success Resize tokenizer embeddings before training: model.resize_token_embeddings(len(tokenizer))
```
Resize tokenizer embeddings before training: model.resize_token_embeddings(len(tokenizer))
```
80% success Use the default tokenizer that comes with the pre-trained model instead of a custom one
```
Use the default tokenizer that comes with the pre-trained model instead of a custom one
```

Resize tokenizer embeddings before training: model.resize_token_embeddings(len(tokenizer))

Use the default tokenizer that comes with the pre-trained model instead of a custom one

Common approaches that don't work:

Setting tokenizer_vocab_size manually in config to match tokenizer size 95% fail
Model embedding layer weights are fixed; resizing requires special method, not config change.
Reinstalling transformers package 80% fail
Error is configuration-related, not installation-related.