llm config_error ai_generated true

KeyError: 'tokenizer_vocab_size' not found in model config for fine-tuning

ID: llm/tokenizer-vocab-mismatch-fine-tune

Also available as: JSON · Markdown · 中文
85%Fix Rate
88%Confidence
1Evidence
2024-01-20First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
transformers 4.36.0 active
transformers 4.37.0 active
PyTorch 2.1.0 active

Root Cause

Mismatch between tokenizer vocabulary size and model embedding layer size when loading a pre-trained model for fine-tuning with a custom tokenizer.

generic

中文

在使用自定义分词器加载预训练模型进行微调时,分词器词汇表大小与模型嵌入层大小不匹配。

Official Documentation

https://huggingface.co/docs/transformers/training

Workarounds

  1. 95% success Resize tokenizer embeddings before training: model.resize_token_embeddings(len(tokenizer))
    Resize tokenizer embeddings before training: model.resize_token_embeddings(len(tokenizer))
  2. 80% success Use the default tokenizer that comes with the pre-trained model instead of a custom one
    Use the default tokenizer that comes with the pre-trained model instead of a custom one

中文步骤

  1. Resize tokenizer embeddings before training: model.resize_token_embeddings(len(tokenizer))
  2. Use the default tokenizer that comes with the pre-trained model instead of a custom one

Dead Ends

Common approaches that don't work:

  1. Setting tokenizer_vocab_size manually in config to match tokenizer size 95% fail

    Model embedding layer weights are fixed; resizing requires special method, not config change.

  2. Reinstalling transformers package 80% fail

    Error is configuration-related, not installation-related.