cudaErrorDriverNotReady (804)
cuda
runtime_error
ai_generated
partial
CUDA 错误:驱动程序处于对请求操作无效的状态 (cudaErrorDriverNotReady)
CUDA error: driver is in a state that is invalid for the requested operation (cudaErrorDriverNotReady)
ID: cuda/cuda-error-driver-unloading
78%修复率
85%置信度
1证据数
2024-03-15首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| CUDA 11.8 | active | — | — | — |
| CUDA 12.0 | active | — | — | — |
| CUDA 12.1 | active | — | — | — |
| CUDA 12.2 | active | — | — | — |
| PyTorch 2.1.0 | active | — | — | — |
| PyTorch 2.2.0 | active | — | — | — |
根因分析
CUDA 驱动程序正在被卸载或已部分卸载,这是由于多线程应用程序关闭时的竞态条件,通常是在其他线程仍持有 CUDA 上下文时调用 cudaDeviceReset() 导致的。
English
The CUDA driver is in the process of being unloaded or has been partially unloaded due to a race condition in multi-threaded application shutdown, often when cudaDeviceReset() is called while other threads still hold CUDA contexts.
官方文档
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html解决方案
-
Ensure all CUDA contexts are destroyed before calling cudaDeviceReset() by using a thread-safe reference counter. For example, in Python with PyTorch: `import torch; torch.cuda.synchronize(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats(); del model; torch.cuda.reset_max_memory_cached()`
-
Avoid calling cudaDeviceReset() in multi-threaded environments; instead, rely on the driver to clean up contexts at process exit. In C++, remove explicit `cudaDeviceReset()` calls from destructors or atexit handlers.
-
Use a try-catch around the reset call and ignore the error if it occurs during shutdown: `try { cudaDeviceReset(); } catch (const std::exception&) { /* ignore during shutdown */ }`
无效尝试
常见但无效的做法:
-
70% 失败
The error occurs during shutdown, so restarting only delays the issue; the race condition persists on subsequent shutdowns.
-
60% 失败
Synchronization does not guarantee that all threads have released their contexts; the driver may still be in an invalid state if other threads are mid-operation.