SIGSEGV pytorch system_error ai_generated true

RuntimeError: DataLoader 工作进程(pid 12345)收到信号 11(段错误)。可能原因:共享内存耗尽或 /dev/shm 中的共享内存文件损坏。

RuntimeError: DataLoader worker (pid 12345) received signal 11 (Segmentation fault). Possible causes: shared memory exhaustion or corrupted shared memory files in /dev/shm.

ID: pytorch/dataloader-worker-segfault-shm

其他格式: JSON · Markdown 中文 · English
85%修复率
87%置信度
1证据数
2023-02-14首次发现

版本兼容性

版本状态引入弃用备注
PyTorch 1.10.0 active
PyTorch 2.0.0 active
Linux kernel 5.15 active
Ubuntu 20.04 active
Ubuntu 22.04 active
Docker containers active

根因分析

DataLoader 工作进程使用共享内存(通过 /dev/shm)进行零拷贝数据传输;当 /dev/shm 已满(例如,由于大量工作进程、大批量大小或其他进程)时,工作进程会因段错误而崩溃。

English

DataLoader workers use shared memory (via /dev/shm) for zero-copy data transfer; when /dev/shm is full (e.g., due to large num_workers, large batch sizes, or other processes), workers crash with a segmentation fault.

generic

官方文档

https://pytorch.org/docs/stable/data.html#multi-process-data-loading

解决方案

  1. Reduce the number of DataLoader workers: DataLoader(dataset, batch_size=64, num_workers=4, ...). Start with num_workers=2 and increase gradually.
  2. Increase the size of /dev/shm by remounting with a larger size: sudo mount -o remount,size=16G /dev/shm. Alternatively, in Docker, use --shm-size=16g flag.
  3. Use multiprocessing_context='spawn' in DataLoader and avoid shared memory by setting pin_memory=False and prefetch_factor=2: DataLoader(..., multiprocessing_context='spawn', pin_memory=False, prefetch_factor=2)

无效尝试

常见但无效的做法:

  1. Increase num_workers to speed up data loading 95% 失败

    More workers consume more shared memory, exacerbating the exhaustion problem and causing more frequent crashes.

  2. Set pin_memory=False in DataLoader 70% 失败

    While this reduces shared memory usage, it may not be sufficient if /dev/shm is already full from other processes or large batch sizes.

  3. Restart the system to clear /dev/shm 60% 失败

    This is a temporary fix; the problem recurs when the training runs again with the same configuration.