llm type_error ai_generated true

ValueError: Token indices sequence length is longer than the specified maximum sequence length — tiktoken vs transformers mismatch

ID: llm/tokenizer-encoding-mismatch-between-libraries

Also available as: JSON · Markdown · 中文

88%Fix Rate

86%Confidence

1Evidence

2024-01-20First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
tiktoken==0.6.0	active	—	—	—
transformers==4.38.0	active	—	—	—
torch==2.2.0	active	—	—	—
gpt-4-1106-preview	active	—	—	—
llama-2-7b-chat-hf	active	—	—	—

Root Cause

Different tokenization libraries (tiktoken vs Hugging Face transformers) produce different token counts for the same text, leading to context window violations when switching between APIs.

generic

中文

不同的分词库（tiktoken与Hugging Face transformers）对相同文本产生不同的令牌计数，导致在API之间切换时出现上下文窗口违规。

Official Documentation

https://github.com/openai/tiktoken

Workarounds

95% success Always use the same tokenizer library for both counting and encoding. For OpenAI models, use tiktoken exclusively; for Hugging Face models, use AutoTokenizer from transformers.
```
Always use the same tokenizer library for both counting and encoding. For OpenAI models, use tiktoken exclusively; for Hugging Face models, use AutoTokenizer from transformers.
```
80% success Calibrate token counts by running a sample through both tokenizers and applying a correction factor (e.g., multiply transformers count by 1.05 for safety margin).
```
Calibrate token counts by running a sample through both tokenizers and applying a correction factor (e.g., multiply transformers count by 1.05 for safety margin).
```

中文步骤

始终使用相同的分词库进行计数和编码。对于OpenAI模型，专门使用tiktoken；对于Hugging Face模型，使用transformers中的AutoTokenizer。

通过对样本运行两个分词器并应用校正因子（例如，将transformers计数乘以1.05作为安全边际）来校准令牌计数。

Dead Ends

Common approaches that don't work:

80% fail
Using the same max_length parameter for both libraries without recalibration will cause truncation or errors.
90% fail
Assuming tiktoken and transformers tokenizers are interchangeable for the same model (e.g., gpt-4) leads to incorrect token budget calculations.
85% fail
Simply increasing max_length in transformers doesn't solve the mismatch because the tokenizer itself counts differently.