CUBLAS_STATUS_ALLOC_FAILED cuda resource_error ai_generated true

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate_v2

ID: cuda/cublas-alloc-failed-internal

Also available as: JSON · Markdown · 中文
78%Fix Rate
82%Confidence
1Evidence
2023-05-20First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
CUDA 11.8 active
CUDA 12.0 active
CUDA 12.1 active

Root Cause

cuBLAS library failed to allocate internal memory, typically due to insufficient GPU memory or a CUDA context that is corrupted or exhausted.

generic

中文

cuBLAS 库无法分配内部内存,通常是由于 GPU 内存不足或 CUDA 上下文已损坏或耗尽。

Official Documentation

https://docs.nvidia.com/cuda/cublas/index.html#cublas-status-t

Workarounds

  1. 80% success Reduce GPU memory usage by decreasing batch size or using gradient accumulation. For example, set batch_size=8 and accumulate gradients over 4 steps: optimizer.zero_grad(); loss.backward(); if (step+1) % 4 == 0: optimizer.step(). Also clear cache with torch.cuda.empty_cache() after each epoch.
    Reduce GPU memory usage by decreasing batch size or using gradient accumulation. For example, set batch_size=8 and accumulate gradients over 4 steps: optimizer.zero_grad(); loss.backward(); if (step+1) % 4 == 0: optimizer.step(). Also clear cache with torch.cuda.empty_cache() after each epoch.
  2. 75% success Restart the Python process and ensure no other processes are using the GPU. Use 'nvidia-smi' to check memory usage and kill competing processes with 'kill -9 <PID>'. Then re-run the code.
    Restart the Python process and ensure no other processes are using the GPU. Use 'nvidia-smi' to check memory usage and kill competing processes with 'kill -9 <PID>'. Then re-run the code.

中文步骤

  1. Reduce GPU memory usage by decreasing batch size or using gradient accumulation. For example, set batch_size=8 and accumulate gradients over 4 steps: optimizer.zero_grad(); loss.backward(); if (step+1) % 4 == 0: optimizer.step(). Also clear cache with torch.cuda.empty_cache() after each epoch.
  2. Restart the Python process and ensure no other processes are using the GPU. Use 'nvidia-smi' to check memory usage and kill competing processes with 'kill -9 <PID>'. Then re-run the code.

Dead Ends

Common approaches that don't work:

  1. 90% fail

    Increasing batch size in the model makes the problem worse by consuming more GPU memory, not less.

  2. 70% fail

    Setting torch.backends.cudnn.enabled = False disables cuDNN but cuBLAS is still used internally; this doesn't free the memory needed by cuBLAS.