cudaErrorDriverNotReady (804) cuda runtime_error ai_generated partial

CUDA 错误:驱动程序处于对请求操作无效的状态 (cudaErrorDriverNotReady)

CUDA error: driver is in a state that is invalid for the requested operation (cudaErrorDriverNotReady)

ID: cuda/cuda-error-driver-unloading

其他格式: JSON · Markdown 中文 · English
78%修复率
85%置信度
1证据数
2024-03-15首次发现

版本兼容性

版本状态引入弃用备注
CUDA 11.8 active
CUDA 12.0 active
CUDA 12.1 active
CUDA 12.2 active
PyTorch 2.1.0 active
PyTorch 2.2.0 active

根因分析

CUDA 驱动程序正在被卸载或已部分卸载,这是由于多线程应用程序关闭时的竞态条件,通常是在其他线程仍持有 CUDA 上下文时调用 cudaDeviceReset() 导致的。

English

The CUDA driver is in the process of being unloaded or has been partially unloaded due to a race condition in multi-threaded application shutdown, often when cudaDeviceReset() is called while other threads still hold CUDA contexts.

generic

官方文档

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html

解决方案

  1. Ensure all CUDA contexts are destroyed before calling cudaDeviceReset() by using a thread-safe reference counter. For example, in Python with PyTorch: `import torch; torch.cuda.synchronize(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats(); del model; torch.cuda.reset_max_memory_cached()`
  2. Avoid calling cudaDeviceReset() in multi-threaded environments; instead, rely on the driver to clean up contexts at process exit. In C++, remove explicit `cudaDeviceReset()` calls from destructors or atexit handlers.
  3. Use a try-catch around the reset call and ignore the error if it occurs during shutdown: `try { cudaDeviceReset(); } catch (const std::exception&) { /* ignore during shutdown */ }`

无效尝试

常见但无效的做法:

  1. 70% 失败

    The error occurs during shutdown, so restarting only delays the issue; the race condition persists on subsequent shutdowns.

  2. 60% 失败

    Synchronization does not guarantee that all threads have released their contexts; the driver may still be in an invalid state if other threads are mid-operation.