# RuntimeError: DataLoader 工作进程（pid 12345）收到信号 11（段错误）。可能原因：共享内存耗尽或 /dev/shm 中的共享内存文件损坏。

- **ID:** `pytorch/dataloader-worker-segfault-shm`
- **领域:** pytorch
- **类别:** system_error
- **错误码:** `SIGSEGV`
- **验证级别:** ai_generated
- **修复率:** 85%

## 根因

DataLoader 工作进程使用共享内存（通过 /dev/shm）进行零拷贝数据传输；当 /dev/shm 已满（例如，由于大量工作进程、大批量大小或其他进程）时，工作进程会因段错误而崩溃。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| PyTorch 1.10.0 | active | — | — |
| PyTorch 2.0.0 | active | — | — |
| Linux kernel 5.15 | active | — | — |
| Ubuntu 20.04 | active | — | — |
| Ubuntu 22.04 | active | — | — |
| Docker containers | active | — | — |

## 解决方案

1. ```
   Reduce the number of DataLoader workers: DataLoader(dataset, batch_size=64, num_workers=4, ...). Start with num_workers=2 and increase gradually.
   ```
2. ```
   Increase the size of /dev/shm by remounting with a larger size: sudo mount -o remount,size=16G /dev/shm. Alternatively, in Docker, use --shm-size=16g flag.
   ```
3. ```
   Use multiprocessing_context='spawn' in DataLoader and avoid shared memory by setting pin_memory=False and prefetch_factor=2: DataLoader(..., multiprocessing_context='spawn', pin_memory=False, prefetch_factor=2)
   ```

## 无效尝试

- **Increase num_workers to speed up data loading** — More workers consume more shared memory, exacerbating the exhaustion problem and causing more frequent crashes. (95% 失败率)
- **Set pin_memory=False in DataLoader** — While this reduces shared memory usage, it may not be sufficient if /dev/shm is already full from other processes or large batch sizes. (70% 失败率)
- **Restart the system to clear /dev/shm** — This is a temporary fix; the problem recurs when the training runs again with the same configuration. (60% 失败率)