RuntimeError: DataLoader 工作进程(pid 12345)收到信号 11(段错误)。可能原因:共享内存耗尽或 /dev/shm 中的共享内存文件损坏。
RuntimeError: DataLoader worker (pid 12345) received signal 11 (Segmentation fault). Possible causes: shared memory exhaustion or corrupted shared memory files in /dev/shm.
ID: pytorch/dataloader-worker-segfault-shm
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| PyTorch 1.10.0 | active | — | — | — |
| PyTorch 2.0.0 | active | — | — | — |
| Linux kernel 5.15 | active | — | — | — |
| Ubuntu 20.04 | active | — | — | — |
| Ubuntu 22.04 | active | — | — | — |
| Docker containers | active | — | — | — |
根因分析
DataLoader 工作进程使用共享内存(通过 /dev/shm)进行零拷贝数据传输;当 /dev/shm 已满(例如,由于大量工作进程、大批量大小或其他进程)时,工作进程会因段错误而崩溃。
English
DataLoader workers use shared memory (via /dev/shm) for zero-copy data transfer; when /dev/shm is full (e.g., due to large num_workers, large batch sizes, or other processes), workers crash with a segmentation fault.
官方文档
https://pytorch.org/docs/stable/data.html#multi-process-data-loading解决方案
-
Reduce the number of DataLoader workers: DataLoader(dataset, batch_size=64, num_workers=4, ...). Start with num_workers=2 and increase gradually.
-
Increase the size of /dev/shm by remounting with a larger size: sudo mount -o remount,size=16G /dev/shm. Alternatively, in Docker, use --shm-size=16g flag.
-
Use multiprocessing_context='spawn' in DataLoader and avoid shared memory by setting pin_memory=False and prefetch_factor=2: DataLoader(..., multiprocessing_context='spawn', pin_memory=False, prefetch_factor=2)
无效尝试
常见但无效的做法:
-
Increase num_workers to speed up data loading
95% 失败
More workers consume more shared memory, exacerbating the exhaustion problem and causing more frequent crashes.
-
Set pin_memory=False in DataLoader
70% 失败
While this reduces shared memory usage, it may not be sufficient if /dev/shm is already full from other processes or large batch sizes.
-
Restart the system to clear /dev/shm
60% 失败
This is a temporary fix; the problem recurs when the training runs again with the same configuration.