CUBLAS_STATUS_ALLOC_FAILED cuda runtime_error ai_generated partial

CUDA 错误:调用 cublasCreate_v2 时 CUBLAS_STATUS_ALLOC_FAILED

CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate_v2

ID: cuda/cublas-api-error-on-shutdown

其他格式: JSON · Markdown 中文 · English
75%修复率
85%置信度
1证据数
2023-03-15首次发现

版本兼容性

版本状态引入弃用备注
CUDA 11.8 active
CUDA 12.1 active
cuBLAS 11.11 active
cuBLAS 12.0 active

根因分析

cuBLAS 句柄分配失败,通常是由于 GPU 内存不足或驱动程序状态损坏,在快速创建/销毁上下文或在之前的 CUDA 错误使设备处于不一致状态后触发。

English

cuBLAS handle allocation fails due to insufficient GPU memory or driver state corruption, often triggered during rapid context creation/destruction or after a previous CUDA error left the device in an inconsistent state.

generic

官方文档

https://docs.nvidia.com/cuda/cublas/index.html#cublascreate

解决方案

  1. Reset the CUDA device by calling `torch.cuda.reset_peak_memory_stats()` and `torch.cuda.empty_cache()` before creating new cuBLAS handles. Then reinitialize the model in a fresh context.
  2. Kill all processes using the GPU with `nvidia-smi` and restart the application. For persistent issues, reboot the machine to fully reset the GPU driver state.

无效尝试

常见但无效的做法:

  1. 80% 失败

    The previous CUDA context may still be alive, and residual allocations prevent new handle creation; a full GPU reset or process kill is needed.

  2. 90% 失败

    The error is not about insufficient memory for tensors but about handle allocation; larger batch sizes exacerbate memory pressure.

  3. 70% 失败

    The issue is often runtime state corruption, not a missing library; driver version mismatch can cause other errors, but this specific error persists.