# ValueError: Token indices sequence length is longer than the specified maximum sequence length — tiktoken vs transformers mismatch

- **ID:** `llm/tokenizer-encoding-mismatch-between-libraries`
- **Domain:** llm
- **Category:** type_error
- **Verification:** ai_generated
- **Fix Rate:** 88%

## Root Cause

Different tokenization libraries (tiktoken vs Hugging Face transformers) produce different token counts for the same text, leading to context window violations when switching between APIs.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| tiktoken==0.6.0 | active | — | — |
| transformers==4.38.0 | active | — | — |
| torch==2.2.0 | active | — | — |
| gpt-4-1106-preview | active | — | — |
| llama-2-7b-chat-hf | active | — | — |

## Workarounds

1. **Always use the same tokenizer library for both counting and encoding. For OpenAI models, use tiktoken exclusively; for Hugging Face models, use AutoTokenizer from transformers.** (95% success)
   ```
   Always use the same tokenizer library for both counting and encoding. For OpenAI models, use tiktoken exclusively; for Hugging Face models, use AutoTokenizer from transformers.
   ```
2. **Calibrate token counts by running a sample through both tokenizers and applying a correction factor (e.g., multiply transformers count by 1.05 for safety margin).** (80% success)
   ```
   Calibrate token counts by running a sample through both tokenizers and applying a correction factor (e.g., multiply transformers count by 1.05 for safety margin).
   ```

## Dead Ends

- **** — Using the same max_length parameter for both libraries without recalibration will cause truncation or errors. (80% fail)
- **** — Assuming tiktoken and transformers tokenizers are interchangeable for the same model (e.g., gpt-4) leads to incorrect token budget calculations. (90% fail)
- **** — Simply increasing max_length in transformers doesn't solve the mismatch because the tokenizer itself counts differently. (85% fail)
