cudaErrorDriverNotReady (804)
cuda
runtime_error
ai_generated
partial
CUDA error: driver is in a state that is invalid for the requested operation (cudaErrorDriverNotReady)
ID: cuda/cuda-error-driver-unloading
78%Fix Rate
85%Confidence
1Evidence
2024-03-15First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| CUDA 11.8 | active | — | — | — |
| CUDA 12.0 | active | — | — | — |
| CUDA 12.1 | active | — | — | — |
| CUDA 12.2 | active | — | — | — |
| PyTorch 2.1.0 | active | — | — | — |
| PyTorch 2.2.0 | active | — | — | — |
Root Cause
The CUDA driver is in the process of being unloaded or has been partially unloaded due to a race condition in multi-threaded application shutdown, often when cudaDeviceReset() is called while other threads still hold CUDA contexts.
generic中文
CUDA 驱动程序正在被卸载或已部分卸载,这是由于多线程应用程序关闭时的竞态条件,通常是在其他线程仍持有 CUDA 上下文时调用 cudaDeviceReset() 导致的。
Official Documentation
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.htmlWorkarounds
-
85% success Ensure all CUDA contexts are destroyed before calling cudaDeviceReset() by using a thread-safe reference counter. For example, in Python with PyTorch: `import torch; torch.cuda.synchronize(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats(); del model; torch.cuda.reset_max_memory_cached()`
Ensure all CUDA contexts are destroyed before calling cudaDeviceReset() by using a thread-safe reference counter. For example, in Python with PyTorch: `import torch; torch.cuda.synchronize(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats(); del model; torch.cuda.reset_max_memory_cached()`
-
90% success Avoid calling cudaDeviceReset() in multi-threaded environments; instead, rely on the driver to clean up contexts at process exit. In C++, remove explicit `cudaDeviceReset()` calls from destructors or atexit handlers.
Avoid calling cudaDeviceReset() in multi-threaded environments; instead, rely on the driver to clean up contexts at process exit. In C++, remove explicit `cudaDeviceReset()` calls from destructors or atexit handlers.
-
75% success Use a try-catch around the reset call and ignore the error if it occurs during shutdown: `try { cudaDeviceReset(); } catch (const std::exception&) { /* ignore during shutdown */ }`
Use a try-catch around the reset call and ignore the error if it occurs during shutdown: `try { cudaDeviceReset(); } catch (const std::exception&) { /* ignore during shutdown */ }`
中文步骤
Ensure all CUDA contexts are destroyed before calling cudaDeviceReset() by using a thread-safe reference counter. For example, in Python with PyTorch: `import torch; torch.cuda.synchronize(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats(); del model; torch.cuda.reset_max_memory_cached()`
Avoid calling cudaDeviceReset() in multi-threaded environments; instead, rely on the driver to clean up contexts at process exit. In C++, remove explicit `cudaDeviceReset()` calls from destructors or atexit handlers.
Use a try-catch around the reset call and ignore the error if it occurs during shutdown: `try { cudaDeviceReset(); } catch (const std::exception&) { /* ignore during shutdown */ }`
Dead Ends
Common approaches that don't work:
-
70% fail
The error occurs during shutdown, so restarting only delays the issue; the race condition persists on subsequent shutdowns.
-
60% fail
Synchronization does not guarantee that all threads have released their contexts; the driver may still be in an invalid state if other threads are mid-operation.