# ValueError: Encountered unknown token ID 100000 in cached sequence. Tokenizer vocabulary mismatch.

- **ID:** `llm/tokenizer-mismatch-cache-miss`
- **Domain:** llm
- **Category:** data_error
- **Verification:** ai_generated
- **Fix Rate:** 80%

## Root Cause

Cached tokenized sequences were generated with a different tokenizer version or model, and the current tokenizer does not recognize token IDs from the cache.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| transformers 4.35.0 | active | — | — |
| tokenizers 0.14.0 | active | — | — |
| vLLM 0.3.0 | active | — | — |
| Llama 2 70B | active | — | — |
| Mistral 7B | active | — | — |

## Workarounds

1. **Invalidate the cache and regenerate it with the current tokenizer. For Hugging Face datasets, run: `dataset.cleanup_cache_files()` and reload with `load_from_disk(new_path)`.** (90% success)
   ```
   Invalidate the cache and regenerate it with the current tokenizer. For Hugging Face datasets, run: `dataset.cleanup_cache_files()` and reload with `load_from_disk(new_path)`.
   ```
2. **If using vLLM or similar inference engines, set environment variable `VLLM_USE_PREEMPTIVE_MODE=1` to disable token caching at the cost of slightly slower inference.** (75% success)
   ```
   If using vLLM or similar inference engines, set environment variable `VLLM_USE_PREEMPTIVE_MODE=1` to disable token caching at the cost of slightly slower inference.
   ```
3. **Pin the tokenizer version in your environment by specifying `tokenizers==0.14.0` in requirements.txt and using a consistent model snapshot (e.g., `model_name_or_path='meta-llama/Llama-2-7b-hf'`).** (85% success)
   ```
   Pin the tokenizer version in your environment by specifying `tokenizers==0.14.0` in requirements.txt and using a consistent model snapshot (e.g., `model_name_or_path='meta-llama/Llama-2-7b-hf'`).
   ```

## Dead Ends

- **** — The cache will be regenerated using the same mismatched tokenizer, and the error will reappear when the tokenizer or model is updated again. (60% fail)
- **** — Mapping unknown IDs to [UNK] discards semantic meaning and degrades model output quality, and the tokenizer may still reject the IDs during decoding. (80% fail)
- **** — The error is not related to fast vs slow tokenizer mode; it is a vocabulary mismatch that persists across both modes. (90% fail)
