llm type_error ai_generated true

ValueError：令牌索引序列长度超过指定的最大序列长度 — tiktoken与transformers不匹配

ValueError: Token indices sequence length is longer than the specified maximum sequence length — tiktoken vs transformers mismatch

ID: llm/tokenizer-encoding-mismatch-between-libraries

其他格式: JSON · Markdown 中文 · English

88%修复率

86%置信度

1证据数

2024-01-20首次发现

版本兼容性

版本	状态	引入	弃用	备注
tiktoken==0.6.0	active	—	—	—
transformers==4.38.0	active	—	—	—
torch==2.2.0	active	—	—	—
gpt-4-1106-preview	active	—	—	—
llama-2-7b-chat-hf	active	—	—	—

根因分析

不同的分词库（tiktoken与Hugging Face transformers）对相同文本产生不同的令牌计数，导致在API之间切换时出现上下文窗口违规。

English

Different tokenization libraries (tiktoken vs Hugging Face transformers) produce different token counts for the same text, leading to context window violations when switching between APIs.

generic

官方文档

https://github.com/openai/tiktoken

解决方案

始终使用相同的分词库进行计数和编码。对于OpenAI模型，专门使用tiktoken；对于Hugging Face模型，使用transformers中的AutoTokenizer。

通过对样本运行两个分词器并应用校正因子（例如，将transformers计数乘以1.05作为安全边际）来校准令牌计数。

无效尝试

常见但无效的做法:

80% 失败
Using the same max_length parameter for both libraries without recalibration will cause truncation or errors.
90% 失败
Assuming tiktoken and transformers tokenizers are interchangeable for the same model (e.g., gpt-4) leads to incorrect token budget calculations.
85% 失败
Simply increasing max_length in transformers doesn't solve the mismatch because the tokenizer itself counts differently.