SIGSEGV pytorch system_error ai_generated true

RuntimeError: DataLoader 工作进程（pid 12345）收到信号 11（段错误）。可能原因：共享内存耗尽或 /dev/shm 中的共享内存文件损坏。

RuntimeError: DataLoader worker (pid 12345) received signal 11 (Segmentation fault). Possible causes: shared memory exhaustion or corrupted shared memory files in /dev/shm.

ID: pytorch/dataloader-worker-segfault-shm

其他格式: JSON · Markdown 中文 · English

85%修复率

87%置信度

1证据数

2023-02-14首次发现

版本兼容性

版本	状态	引入	弃用	备注
PyTorch 1.10.0	active	—	—	—
PyTorch 2.0.0	active	—	—	—
Linux kernel 5.15	active	—	—	—
Ubuntu 20.04	active	—	—	—
Ubuntu 22.04	active	—	—	—
Docker containers	active	—	—	—

根因分析

DataLoader 工作进程使用共享内存（通过 /dev/shm）进行零拷贝数据传输；当 /dev/shm 已满（例如，由于大量工作进程、大批量大小或其他进程）时，工作进程会因段错误而崩溃。

English

DataLoader workers use shared memory (via /dev/shm) for zero-copy data transfer; when /dev/shm is full (e.g., due to large num_workers, large batch sizes, or other processes), workers crash with a segmentation fault.

generic

官方文档

https://pytorch.org/docs/stable/data.html#multi-process-data-loading

解决方案

Reduce the number of DataLoader workers: DataLoader(dataset, batch_size=64, num_workers=4, ...). Start with num_workers=2 and increase gradually.

Increase the size of /dev/shm by remounting with a larger size: sudo mount -o remount,size=16G /dev/shm. Alternatively, in Docker, use --shm-size=16g flag.

Use multiprocessing_context='spawn' in DataLoader and avoid shared memory by setting pin_memory=False and prefetch_factor=2: DataLoader(..., multiprocessing_context='spawn', pin_memory=False, prefetch_factor=2)

无效尝试

常见但无效的做法:

Increase num_workers to speed up data loading 95% 失败
More workers consume more shared memory, exacerbating the exhaustion problem and causing more frequent crashes.
Set pin_memory=False in DataLoader 70% 失败
While this reduces shared memory usage, it may not be sufficient if /dev/shm is already full from other processes or large batch sizes.
Restart the system to clear /dev/shm 60% 失败
This is a temporary fix; the problem recurs when the training runs again with the same configuration.