cudaErrorDriverNotReady (804) cuda runtime_error ai_generated partial

CUDA error: driver is in a state that is invalid for the requested operation (cudaErrorDriverNotReady)

ID: cuda/cuda-error-driver-unloading

Also available as: JSON · Markdown · 中文
78%Fix Rate
85%Confidence
1Evidence
2024-03-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
CUDA 11.8 active
CUDA 12.0 active
CUDA 12.1 active
CUDA 12.2 active
PyTorch 2.1.0 active
PyTorch 2.2.0 active

Root Cause

The CUDA driver is in the process of being unloaded or has been partially unloaded due to a race condition in multi-threaded application shutdown, often when cudaDeviceReset() is called while other threads still hold CUDA contexts.

generic

中文

CUDA 驱动程序正在被卸载或已部分卸载,这是由于多线程应用程序关闭时的竞态条件,通常是在其他线程仍持有 CUDA 上下文时调用 cudaDeviceReset() 导致的。

Official Documentation

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html

Workarounds

  1. 85% success Ensure all CUDA contexts are destroyed before calling cudaDeviceReset() by using a thread-safe reference counter. For example, in Python with PyTorch: `import torch; torch.cuda.synchronize(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats(); del model; torch.cuda.reset_max_memory_cached()`
    Ensure all CUDA contexts are destroyed before calling cudaDeviceReset() by using a thread-safe reference counter. For example, in Python with PyTorch: `import torch; torch.cuda.synchronize(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats(); del model; torch.cuda.reset_max_memory_cached()`
  2. 90% success Avoid calling cudaDeviceReset() in multi-threaded environments; instead, rely on the driver to clean up contexts at process exit. In C++, remove explicit `cudaDeviceReset()` calls from destructors or atexit handlers.
    Avoid calling cudaDeviceReset() in multi-threaded environments; instead, rely on the driver to clean up contexts at process exit. In C++, remove explicit `cudaDeviceReset()` calls from destructors or atexit handlers.
  3. 75% success Use a try-catch around the reset call and ignore the error if it occurs during shutdown: `try { cudaDeviceReset(); } catch (const std::exception&) { /* ignore during shutdown */ }`
    Use a try-catch around the reset call and ignore the error if it occurs during shutdown: `try { cudaDeviceReset(); } catch (const std::exception&) { /* ignore during shutdown */ }`

中文步骤

  1. Ensure all CUDA contexts are destroyed before calling cudaDeviceReset() by using a thread-safe reference counter. For example, in Python with PyTorch: `import torch; torch.cuda.synchronize(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats(); del model; torch.cuda.reset_max_memory_cached()`
  2. Avoid calling cudaDeviceReset() in multi-threaded environments; instead, rely on the driver to clean up contexts at process exit. In C++, remove explicit `cudaDeviceReset()` calls from destructors or atexit handlers.
  3. Use a try-catch around the reset call and ignore the error if it occurs during shutdown: `try { cudaDeviceReset(); } catch (const std::exception&) { /* ignore during shutdown */ }`

Dead Ends

Common approaches that don't work:

  1. 70% fail

    The error occurs during shutdown, so restarting only delays the issue; the race condition persists on subsequent shutdowns.

  2. 60% fail

    Synchronization does not guarantee that all threads have released their contexts; the driver may still be in an invalid state if other threads are mid-operation.