llm resource_error ai_generated true

ValueError: Could not load tokenizer cache from /home/user/.cache/huggingface/hub — file is corrupted or truncated

ID: llm/tokenizer-cache-corruption-multiprocessing

Also available as: JSON · Markdown · 中文
88%Fix Rate
85%Confidence
1Evidence
2024-01-10First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
transformers==4.38.0 active
tokenizers==0.15.2 active
torch==2.2.0 active

Root Cause

Multiple processes (e.g., DataLoader workers with num_workers > 1) concurrently download or write to the Hugging Face tokenizer cache, causing race conditions that corrupt the cached tokenizer files.

generic

中文

多个进程(例如,num_workers > 1 的 DataLoader 工作进程)同时下载或写入 Hugging Face 分词器缓存,导致竞态条件,损坏缓存的分词器文件。

Official Documentation

https://huggingface.co/docs/huggingface_hub/en/guides/cache#cache-corruption

Workarounds

  1. 95% success Set the environment variable HF_HUB_ENABLE_HF_TRANSFER=1 and use a local lock file to serialize cache writes. Example: import os os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1' # Use a file lock for cache directory import filelock lock = filelock.FileLock('/tmp/hf_cache.lock') with lock: tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
    Set the environment variable HF_HUB_ENABLE_HF_TRANSFER=1 and use a local lock file to serialize cache writes. Example:
    
    import os
    os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'
    # Use a file lock for cache directory
    import filelock
    lock = filelock.FileLock('/tmp/hf_cache.lock')
    with lock:
        tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
  2. 85% success Set num_workers=0 in DataLoader to disable multiprocessing for tokenization, forcing all tokenization to happen in the main process.
    Set num_workers=0 in DataLoader to disable multiprocessing for tokenization, forcing all tokenization to happen in the main process.
  3. 90% success Pre-download the tokenizer before spawning workers: run AutoTokenizer.from_pretrained() once in the main process before creating the DataLoader. This ensures the cache is populated before workers access it.
    Pre-download the tokenizer before spawning workers: run AutoTokenizer.from_pretrained() once in the main process before creating the DataLoader. This ensures the cache is populated before workers access it.

中文步骤

  1. 设置环境变量 HF_HUB_ENABLE_HF_TRANSFER=1 并使用本地锁文件序列化缓存写入。示例:
    
    import os
    os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'
    # 使用文件锁保护缓存目录
    import filelock
    lock = filelock.FileLock('/tmp/hf_cache.lock')
    with lock:
        tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
  2. 在 DataLoader 中设置 num_workers=0 以禁用分词的多进程处理,强制所有分词在主进程中完成。
  3. 在生成工作进程之前预下载分词器:在创建 DataLoader 之前,在主进程中运行一次 AutoTokenizer.from_pretrained()。这确保在工作进程访问之前缓存已填充。

Dead Ends

Common approaches that don't work:

  1. Setting TOKENIZERS_PARALLELISM=false in environment 70% fail

    This only disables tokenizer parallelism within a single process, it does not prevent cache corruption from multiple processes writing to the same cache directory.

  2. Manually deleting the cache file and re-running 90% fail

    The corruption will recur if the root cause (concurrent writes) is not addressed. The cache will be corrupted again on the next run with multiple workers.

  3. Increasing HF_HUB_DOWNLOAD_TIMEOUT 95% fail

    Timeout is not the issue; the issue is concurrent writes. Increasing timeout does not prevent race conditions.