SIGSEGV pytorch system_error ai_generated true

RuntimeError: DataLoader worker (pid 12345) received signal 11 (Segmentation fault). Possible causes: shared memory exhaustion or corrupted shared memory files in /dev/shm.

ID: pytorch/dataloader-worker-segfault-shm

Also available as: JSON · Markdown · 中文

85%Fix Rate

87%Confidence

1Evidence

2023-02-14First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
PyTorch 1.10.0	active	—	—	—
PyTorch 2.0.0	active	—	—	—
Linux kernel 5.15	active	—	—	—
Ubuntu 20.04	active	—	—	—
Ubuntu 22.04	active	—	—	—
Docker containers	active	—	—	—

Root Cause

DataLoader workers use shared memory (via /dev/shm) for zero-copy data transfer; when /dev/shm is full (e.g., due to large num_workers, large batch sizes, or other processes), workers crash with a segmentation fault.

generic

中文

DataLoader 工作进程使用共享内存（通过 /dev/shm）进行零拷贝数据传输；当 /dev/shm 已满（例如，由于大量工作进程、大批量大小或其他进程）时，工作进程会因段错误而崩溃。

Official Documentation

https://pytorch.org/docs/stable/data.html#multi-process-data-loading

Workarounds

85% success Reduce the number of DataLoader workers: DataLoader(dataset, batch_size=64, num_workers=4, ...). Start with num_workers=2 and increase gradually.
```
Reduce the number of DataLoader workers: DataLoader(dataset, batch_size=64, num_workers=4, ...). Start with num_workers=2 and increase gradually.
```
95% success Increase the size of /dev/shm by remounting with a larger size: sudo mount -o remount,size=16G /dev/shm. Alternatively, in Docker, use --shm-size=16g flag.
```
Increase the size of /dev/shm by remounting with a larger size: sudo mount -o remount,size=16G /dev/shm. Alternatively, in Docker, use --shm-size=16g flag.
```
80% success Use multiprocessing_context='spawn' in DataLoader and avoid shared memory by setting pin_memory=False and prefetch_factor=2: DataLoader(..., multiprocessing_context='spawn', pin_memory=False, prefetch_factor=2)
```
Use multiprocessing_context='spawn' in DataLoader and avoid shared memory by setting pin_memory=False and prefetch_factor=2: DataLoader(..., multiprocessing_context='spawn', pin_memory=False, prefetch_factor=2)
```

中文步骤

Reduce the number of DataLoader workers: DataLoader(dataset, batch_size=64, num_workers=4, ...). Start with num_workers=2 and increase gradually.

Increase the size of /dev/shm by remounting with a larger size: sudo mount -o remount,size=16G /dev/shm. Alternatively, in Docker, use --shm-size=16g flag.

Use multiprocessing_context='spawn' in DataLoader and avoid shared memory by setting pin_memory=False and prefetch_factor=2: DataLoader(..., multiprocessing_context='spawn', pin_memory=False, prefetch_factor=2)

Dead Ends

Common approaches that don't work:

Increase num_workers to speed up data loading 95% fail
More workers consume more shared memory, exacerbating the exhaustion problem and causing more frequent crashes.
Set pin_memory=False in DataLoader 70% fail
While this reduces shared memory usage, it may not be sufficient if /dev/shm is already full from other processes or large batch sizes.
Restart the system to clear /dev/shm 60% fail
This is a temporary fix; the problem recurs when the training runs again with the same configuration.