SIGSEGV pytorch system_error ai_generated true

RuntimeError: DataLoader worker (pid 12345) received signal 11 (Segmentation fault). Possible causes: shared memory exhaustion or corrupted shared memory files in /dev/shm.

ID: pytorch/dataloader-worker-segfault-shm

Also available as: JSON · Markdown · 中文
85%Fix Rate
87%Confidence
1Evidence
2023-02-14First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
PyTorch 1.10.0 active
PyTorch 2.0.0 active
Linux kernel 5.15 active
Ubuntu 20.04 active
Ubuntu 22.04 active
Docker containers active

Root Cause

DataLoader workers use shared memory (via /dev/shm) for zero-copy data transfer; when /dev/shm is full (e.g., due to large num_workers, large batch sizes, or other processes), workers crash with a segmentation fault.

generic

中文

DataLoader 工作进程使用共享内存(通过 /dev/shm)进行零拷贝数据传输;当 /dev/shm 已满(例如,由于大量工作进程、大批量大小或其他进程)时,工作进程会因段错误而崩溃。

Official Documentation

https://pytorch.org/docs/stable/data.html#multi-process-data-loading

Workarounds

  1. 85% success Reduce the number of DataLoader workers: DataLoader(dataset, batch_size=64, num_workers=4, ...). Start with num_workers=2 and increase gradually.
    Reduce the number of DataLoader workers: DataLoader(dataset, batch_size=64, num_workers=4, ...). Start with num_workers=2 and increase gradually.
  2. 95% success Increase the size of /dev/shm by remounting with a larger size: sudo mount -o remount,size=16G /dev/shm. Alternatively, in Docker, use --shm-size=16g flag.
    Increase the size of /dev/shm by remounting with a larger size: sudo mount -o remount,size=16G /dev/shm. Alternatively, in Docker, use --shm-size=16g flag.
  3. 80% success Use multiprocessing_context='spawn' in DataLoader and avoid shared memory by setting pin_memory=False and prefetch_factor=2: DataLoader(..., multiprocessing_context='spawn', pin_memory=False, prefetch_factor=2)
    Use multiprocessing_context='spawn' in DataLoader and avoid shared memory by setting pin_memory=False and prefetch_factor=2: DataLoader(..., multiprocessing_context='spawn', pin_memory=False, prefetch_factor=2)

中文步骤

  1. Reduce the number of DataLoader workers: DataLoader(dataset, batch_size=64, num_workers=4, ...). Start with num_workers=2 and increase gradually.
  2. Increase the size of /dev/shm by remounting with a larger size: sudo mount -o remount,size=16G /dev/shm. Alternatively, in Docker, use --shm-size=16g flag.
  3. Use multiprocessing_context='spawn' in DataLoader and avoid shared memory by setting pin_memory=False and prefetch_factor=2: DataLoader(..., multiprocessing_context='spawn', pin_memory=False, prefetch_factor=2)

Dead Ends

Common approaches that don't work:

  1. Increase num_workers to speed up data loading 95% fail

    More workers consume more shared memory, exacerbating the exhaustion problem and causing more frequent crashes.

  2. Set pin_memory=False in DataLoader 70% fail

    While this reduces shared memory usage, it may not be sufficient if /dev/shm is already full from other processes or large batch sizes.

  3. Restart the system to clear /dev/shm 60% fail

    This is a temporary fix; the problem recurs when the training runs again with the same configuration.