ValueError: 无法从 /home/user/.cache/huggingface/hub 加载分词器缓存 — 文件已损坏或截断
ValueError: Could not load tokenizer cache from /home/user/.cache/huggingface/hub — file is corrupted or truncated
ID: llm/tokenizer-cache-corruption-multiprocessing
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| transformers==4.38.0 | active | — | — | — |
| tokenizers==0.15.2 | active | — | — | — |
| torch==2.2.0 | active | — | — | — |
根因分析
多个进程(例如,num_workers > 1 的 DataLoader 工作进程)同时下载或写入 Hugging Face 分词器缓存,导致竞态条件,损坏缓存的分词器文件。
English
Multiple processes (e.g., DataLoader workers with num_workers > 1) concurrently download or write to the Hugging Face tokenizer cache, causing race conditions that corrupt the cached tokenizer files.
官方文档
https://huggingface.co/docs/huggingface_hub/en/guides/cache#cache-corruption解决方案
-
设置环境变量 HF_HUB_ENABLE_HF_TRANSFER=1 并使用本地锁文件序列化缓存写入。示例: import os os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1' # 使用文件锁保护缓存目录 import filelock lock = filelock.FileLock('/tmp/hf_cache.lock') with lock: tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') -
在 DataLoader 中设置 num_workers=0 以禁用分词的多进程处理,强制所有分词在主进程中完成。
-
在生成工作进程之前预下载分词器:在创建 DataLoader 之前,在主进程中运行一次 AutoTokenizer.from_pretrained()。这确保在工作进程访问之前缓存已填充。
无效尝试
常见但无效的做法:
-
Setting TOKENIZERS_PARALLELISM=false in environment
70% 失败
This only disables tokenizer parallelism within a single process, it does not prevent cache corruption from multiple processes writing to the same cache directory.
-
Manually deleting the cache file and re-running
90% 失败
The corruption will recur if the root cause (concurrent writes) is not addressed. The cache will be corrupted again on the next run with multiple workers.
-
Increasing HF_HUB_DOWNLOAD_TIMEOUT
95% 失败
Timeout is not the issue; the issue is concurrent writes. Increasing timeout does not prevent race conditions.