llm config_error ai_generated true

键错误:微调模型配置中未找到'tokenizer_vocab_size'。

KeyError: 'tokenizer_vocab_size' not found in model config for fine-tuning

ID: llm/tokenizer-vocab-mismatch-fine-tune

其他格式: JSON · Markdown 中文 · English
85%修复率
88%置信度
1证据数
2024-01-20首次发现

版本兼容性

版本状态引入弃用备注
transformers 4.36.0 active
transformers 4.37.0 active
PyTorch 2.1.0 active

根因分析

在使用自定义分词器加载预训练模型进行微调时,分词器词汇表大小与模型嵌入层大小不匹配。

English

Mismatch between tokenizer vocabulary size and model embedding layer size when loading a pre-trained model for fine-tuning with a custom tokenizer.

generic

官方文档

https://huggingface.co/docs/transformers/training

解决方案

  1. Resize tokenizer embeddings before training: model.resize_token_embeddings(len(tokenizer))
  2. Use the default tokenizer that comes with the pre-trained model instead of a custom one

无效尝试

常见但无效的做法:

  1. Setting tokenizer_vocab_size manually in config to match tokenizer size 95% 失败

    Model embedding layer weights are fixed; resizing requires special method, not config change.

  2. Reinstalling transformers package 80% 失败

    Error is configuration-related, not installation-related.