cudaErrorDriverNotReady (804) cuda runtime_error ai_generated partial

CUDA error: driver is in a state that is invalid for the requested operation (cudaErrorDriverNotReady)

ID: cuda/cuda-error-driver-unloading

Also available as: JSON · Markdown · 中文

78%Fix Rate

85%Confidence

1Evidence

2024-03-15First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
CUDA 11.8	active	—	—	—
CUDA 12.0	active	—	—	—
CUDA 12.1	active	—	—	—
CUDA 12.2	active	—	—	—
PyTorch 2.1.0	active	—	—	—
PyTorch 2.2.0	active	—	—	—

Root Cause

The CUDA driver is in the process of being unloaded or has been partially unloaded due to a race condition in multi-threaded application shutdown, often when cudaDeviceReset() is called while other threads still hold CUDA contexts.

generic

中文

CUDA 驱动程序正在被卸载或已部分卸载，这是由于多线程应用程序关闭时的竞态条件，通常是在其他线程仍持有 CUDA 上下文时调用 cudaDeviceReset() 导致的。

Official Documentation

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html

Workarounds

85% success Ensure all CUDA contexts are destroyed before calling cudaDeviceReset() by using a thread-safe reference counter. For example, in Python with PyTorch: `import torch; torch.cuda.synchronize(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats(); del model; torch.cuda.reset_max_memory_cached()`
```
Ensure all CUDA contexts are destroyed before calling cudaDeviceReset() by using a thread-safe reference counter. For example, in Python with PyTorch: `import torch; torch.cuda.synchronize(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats(); del model; torch.cuda.reset_max_memory_cached()`
```
90% success Avoid calling cudaDeviceReset() in multi-threaded environments; instead, rely on the driver to clean up contexts at process exit. In C++, remove explicit `cudaDeviceReset()` calls from destructors or atexit handlers.
```
Avoid calling cudaDeviceReset() in multi-threaded environments; instead, rely on the driver to clean up contexts at process exit. In C++, remove explicit `cudaDeviceReset()` calls from destructors or atexit handlers.
```
75% success Use a try-catch around the reset call and ignore the error if it occurs during shutdown: `try { cudaDeviceReset(); } catch (const std::exception&) { /* ignore during shutdown */ }`
```
Use a try-catch around the reset call and ignore the error if it occurs during shutdown: `try { cudaDeviceReset(); } catch (const std::exception&) { /* ignore during shutdown */ }`
```

中文步骤

Ensure all CUDA contexts are destroyed before calling cudaDeviceReset() by using a thread-safe reference counter. For example, in Python with PyTorch: `import torch; torch.cuda.synchronize(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats(); del model; torch.cuda.reset_max_memory_cached()`

Avoid calling cudaDeviceReset() in multi-threaded environments; instead, rely on the driver to clean up contexts at process exit. In C++, remove explicit `cudaDeviceReset()` calls from destructors or atexit handlers.

Use a try-catch around the reset call and ignore the error if it occurs during shutdown: `try { cudaDeviceReset(); } catch (const std::exception&) { /* ignore during shutdown */ }`

Dead Ends

Common approaches that don't work:

70% fail
The error occurs during shutdown, so restarting only delays the issue; the race condition persists on subsequent shutdowns.
60% fail
Synchronization does not guarantee that all threads have released their contexts; the driver may still be in an invalid state if other threads are mid-operation.