# ValueError: 无法从 /home/user/.cache/huggingface/hub 加载分词器缓存 — 文件已损坏或截断

- **ID:** `llm/tokenizer-cache-corruption-multiprocessing`
- **领域:** llm
- **类别:** resource_error
- **验证级别:** ai_generated
- **修复率:** 88%

## 根因

多个进程（例如，num_workers > 1 的 DataLoader 工作进程）同时下载或写入 Hugging Face 分词器缓存，导致竞态条件，损坏缓存的分词器文件。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| transformers==4.38.0 | active | — | — |
| tokenizers==0.15.2 | active | — | — |
| torch==2.2.0 | active | — | — |

## 解决方案

1. ```
   设置环境变量 HF_HUB_ENABLE_HF_TRANSFER=1 并使用本地锁文件序列化缓存写入。示例：

import os
os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'
# 使用文件锁保护缓存目录
import filelock
lock = filelock.FileLock('/tmp/hf_cache.lock')
with lock:
    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
   ```
2. ```
   在 DataLoader 中设置 num_workers=0 以禁用分词的多进程处理，强制所有分词在主进程中完成。
   ```
3. ```
   在生成工作进程之前预下载分词器：在创建 DataLoader 之前，在主进程中运行一次 AutoTokenizer.from_pretrained()。这确保在工作进程访问之前缓存已填充。
   ```

## 无效尝试

- **Setting TOKENIZERS_PARALLELISM=false in environment** — This only disables tokenizer parallelism within a single process, it does not prevent cache corruption from multiple processes writing to the same cache directory. (70% 失败率)
- **Manually deleting the cache file and re-running** — The corruption will recur if the root cause (concurrent writes) is not addressed. The cache will be corrupted again on the next run with multiple workers. (90% 失败率)
- **Increasing HF_HUB_DOWNLOAD_TIMEOUT** — Timeout is not the issue; the issue is concurrent writes. Increasing timeout does not prevent race conditions. (95% 失败率)