CUDA_ERROR_NOT_INITIALIZED pytorch config_error ai_generated true

RuntimeError: 无法在分叉的子进程中重新初始化 CUDA。要在多进程中与 CUDA 一起使用,必须使用 'spawn' 启动方法。

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method.

ID: pytorch/dataset-worker-fork-cuda

其他格式: JSON · Markdown 中文 · English
90%修复率
90%置信度
1证据数
2023-01-10首次发现

版本兼容性

版本状态引入弃用备注
PyTorch 1.13.0 active
PyTorch 2.0.0 active
CUDA 11.7 active
CUDA 12.1 active
Ubuntu 20.04 active
Ubuntu 22.04 active

根因分析

Linux 上默认的 'fork' 启动方法创建的子进程继承了父进程的 CUDA 上下文,但 CUDA 不支持在分叉进程中重新初始化,导致 DataLoader 工作进程尝试使用 CUDA 时出错。

English

The default 'fork' start method on Linux creates child processes that inherit the parent's CUDA context, but CUDA does not support re-initialization in forked processes, leading to errors when DataLoader workers try to use CUDA.

generic

官方文档

https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing

解决方案

  1. Set the start method to 'spawn' at the beginning of the script: import multiprocessing as mp; mp.set_start_method('spawn', force=True). This creates new processes that do not inherit the CUDA context.
  2. Use the DataLoader with pin_memory=True and num_workers>0 only after moving the model to CPU temporarily, or use a custom collate_fn that moves data to GPU after loading.
  3. Wrap the training code in a if __name__ == '__main__': block and call mp.set_start_method('spawn') before any CUDA calls. Example: if __name__ == '__main__': mp.set_start_method('spawn'); train()

无效尝试

常见但无效的做法:

  1. Set num_workers=0 in DataLoader to disable multiprocessing 60% 失败

    This eliminates parallelism entirely, significantly slowing down data loading, especially for large datasets or heavy preprocessing.

  2. Move CUDA operations to after DataLoader workers are created 95% 失败

    The error occurs because workers inherit the CUDA context from the parent; moving operations does not change the inheritance problem.

  3. Use torch.cuda.set_device(0) inside worker_init_fn 90% 失败

    Setting the device after fork does not resolve the CUDA re-initialization issue; the context is already corrupted.