ValueError:令牌索引序列长度超过指定的最大序列长度 — tiktoken与transformers不匹配
ValueError: Token indices sequence length is longer than the specified maximum sequence length — tiktoken vs transformers mismatch
ID: llm/tokenizer-encoding-mismatch-between-libraries
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| tiktoken==0.6.0 | active | — | — | — |
| transformers==4.38.0 | active | — | — | — |
| torch==2.2.0 | active | — | — | — |
| gpt-4-1106-preview | active | — | — | — |
| llama-2-7b-chat-hf | active | — | — | — |
根因分析
不同的分词库(tiktoken与Hugging Face transformers)对相同文本产生不同的令牌计数,导致在API之间切换时出现上下文窗口违规。
English
Different tokenization libraries (tiktoken vs Hugging Face transformers) produce different token counts for the same text, leading to context window violations when switching between APIs.
官方文档
https://github.com/openai/tiktoken解决方案
-
始终使用相同的分词库进行计数和编码。对于OpenAI模型,专门使用tiktoken;对于Hugging Face模型,使用transformers中的AutoTokenizer。
-
通过对样本运行两个分词器并应用校正因子(例如,将transformers计数乘以1.05作为安全边际)来校准令牌计数。
无效尝试
常见但无效的做法:
-
80% 失败
Using the same max_length parameter for both libraries without recalibration will cause truncation or errors.
-
90% 失败
Assuming tiktoken and transformers tokenizers are interchangeable for the same model (e.g., gpt-4) leads to incorrect token budget calculations.
-
85% 失败
Simply increasing max_length in transformers doesn't solve the mismatch because the tokenizer itself counts differently.