SIGSEGV
pytorch
system_error
ai_generated
true
RuntimeError: DataLoader worker (pid 12345) received signal 11 (Segmentation fault). Possible causes: shared memory exhaustion or corrupted shared memory files in /dev/shm.
ID: pytorch/dataloader-worker-segfault-shm
85%Fix Rate
87%Confidence
1Evidence
2023-02-14First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| PyTorch 1.10.0 | active | — | — | — |
| PyTorch 2.0.0 | active | — | — | — |
| Linux kernel 5.15 | active | — | — | — |
| Ubuntu 20.04 | active | — | — | — |
| Ubuntu 22.04 | active | — | — | — |
| Docker containers | active | — | — | — |
Root Cause
DataLoader workers use shared memory (via /dev/shm) for zero-copy data transfer; when /dev/shm is full (e.g., due to large num_workers, large batch sizes, or other processes), workers crash with a segmentation fault.
generic中文
DataLoader 工作进程使用共享内存(通过 /dev/shm)进行零拷贝数据传输;当 /dev/shm 已满(例如,由于大量工作进程、大批量大小或其他进程)时,工作进程会因段错误而崩溃。
Official Documentation
https://pytorch.org/docs/stable/data.html#multi-process-data-loadingWorkarounds
-
85% success Reduce the number of DataLoader workers: DataLoader(dataset, batch_size=64, num_workers=4, ...). Start with num_workers=2 and increase gradually.
Reduce the number of DataLoader workers: DataLoader(dataset, batch_size=64, num_workers=4, ...). Start with num_workers=2 and increase gradually.
-
95% success Increase the size of /dev/shm by remounting with a larger size: sudo mount -o remount,size=16G /dev/shm. Alternatively, in Docker, use --shm-size=16g flag.
Increase the size of /dev/shm by remounting with a larger size: sudo mount -o remount,size=16G /dev/shm. Alternatively, in Docker, use --shm-size=16g flag.
-
80% success Use multiprocessing_context='spawn' in DataLoader and avoid shared memory by setting pin_memory=False and prefetch_factor=2: DataLoader(..., multiprocessing_context='spawn', pin_memory=False, prefetch_factor=2)
Use multiprocessing_context='spawn' in DataLoader and avoid shared memory by setting pin_memory=False and prefetch_factor=2: DataLoader(..., multiprocessing_context='spawn', pin_memory=False, prefetch_factor=2)
中文步骤
Reduce the number of DataLoader workers: DataLoader(dataset, batch_size=64, num_workers=4, ...). Start with num_workers=2 and increase gradually.
Increase the size of /dev/shm by remounting with a larger size: sudo mount -o remount,size=16G /dev/shm. Alternatively, in Docker, use --shm-size=16g flag.
Use multiprocessing_context='spawn' in DataLoader and avoid shared memory by setting pin_memory=False and prefetch_factor=2: DataLoader(..., multiprocessing_context='spawn', pin_memory=False, prefetch_factor=2)
Dead Ends
Common approaches that don't work:
-
Increase num_workers to speed up data loading
95% fail
More workers consume more shared memory, exacerbating the exhaustion problem and causing more frequent crashes.
-
Set pin_memory=False in DataLoader
70% fail
While this reduces shared memory usage, it may not be sufficient if /dev/shm is already full from other processes or large batch sizes.
-
Restart the system to clear /dev/shm
60% fail
This is a temporary fix; the problem recurs when the training runs again with the same configuration.