llm config_error ai_generated true

KeyError: 'tokenizer_vocab_size' not found in model config

ID: llm/tokenizer-vocab-mismatch

Also available as: JSON · Markdown · 中文
90%Fix Rate
87%Confidence
1Evidence
2023-09-01First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
transformers==4.35.0 active
transformers==4.38.0 active
llama-2-7b-hf active
mistral-7b-v0.1 active

Root Cause

When fine-tuning or loading a model, the tokenizer configuration file is missing the 'tokenizer_vocab_size' key, often due to using a mismatched tokenizer or an incomplete model card from Hugging Face.

generic

中文

微调或加载模型时,分词器配置文件缺少 'tokenizer_vocab_size' 键,通常是由于使用了不匹配的分词器或不完整的 Hugging Face 模型卡。

Official Documentation

https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForCausalLM

Workarounds

  1. 95% success Load the tokenizer separately and set the config manually: `from transformers import AutoTokenizer; tokenizer = AutoTokenizer.from_pretrained('model_name'); model.config.vocab_size = len(tokenizer)`
    Load the tokenizer separately and set the config manually: `from transformers import AutoTokenizer; tokenizer = AutoTokenizer.from_pretrained('model_name'); model.config.vocab_size = len(tokenizer)`
  2. 85% success Use a different model variant that includes the tokenizer config (e.g., prefer '-hf' variants from Hugging Face).
    Use a different model variant that includes the tokenizer config (e.g., prefer '-hf' variants from Hugging Face).
  3. 90% success Download the full model directory including tokenizer files from Hugging Face instead of using a partial or cached version.
    Download the full model directory including tokenizer files from Hugging Face instead of using a partial or cached version.

中文步骤

  1. 单独加载分词器并手动设置配置:`from transformers import AutoTokenizer; tokenizer = AutoTokenizer.from_pretrained('model_name'); model.config.vocab_size = len(tokenizer)`
  2. 使用包含分词器配置的不同模型变体(例如,优先使用 Hugging Face 的 '-hf' 变体)。
  3. 从 Hugging Face 下载完整的模型目录(包括分词器文件),而不是使用部分或缓存版本。

Dead Ends

Common approaches that don't work:

  1. 90% fail

    The value must match the actual tokenizer vocabulary size; an arbitrary value will cause embedding dimension mismatches or runtime errors.

  2. 95% fail

    The error is a configuration issue with the model, not a library installation problem.