CUBLAS_STATUS_ALLOC_FAILED cuda resource_error ai_generated true

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate_v2

ID: cuda/cublas-alloc-failed-internal

Also available as: JSON · Markdown · 中文

78%Fix Rate

82%Confidence

1Evidence

2023-05-20First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
CUDA 11.8	active	—	—	—
CUDA 12.0	active	—	—	—
CUDA 12.1	active	—	—	—

Root Cause

cuBLAS library failed to allocate internal memory, typically due to insufficient GPU memory or a CUDA context that is corrupted or exhausted.

generic

中文

cuBLAS 库无法分配内部内存，通常是由于 GPU 内存不足或 CUDA 上下文已损坏或耗尽。

Official Documentation

https://docs.nvidia.com/cuda/cublas/index.html#cublas-status-t

Workarounds

80% success Reduce GPU memory usage by decreasing batch size or using gradient accumulation. For example, set batch_size=8 and accumulate gradients over 4 steps: optimizer.zero_grad(); loss.backward(); if (step+1) % 4 == 0: optimizer.step(). Also clear cache with torch.cuda.empty_cache() after each epoch.
```
Reduce GPU memory usage by decreasing batch size or using gradient accumulation. For example, set batch_size=8 and accumulate gradients over 4 steps: optimizer.zero_grad(); loss.backward(); if (step+1) % 4 == 0: optimizer.step(). Also clear cache with torch.cuda.empty_cache() after each epoch.
```
75% success Restart the Python process and ensure no other processes are using the GPU. Use 'nvidia-smi' to check memory usage and kill competing processes with 'kill -9 <PID>'. Then re-run the code.
```
Restart the Python process and ensure no other processes are using the GPU. Use 'nvidia-smi' to check memory usage and kill competing processes with 'kill -9 <PID>'. Then re-run the code.
```

中文步骤

Reduce GPU memory usage by decreasing batch size or using gradient accumulation. For example, set batch_size=8 and accumulate gradients over 4 steps: optimizer.zero_grad(); loss.backward(); if (step+1) % 4 == 0: optimizer.step(). Also clear cache with torch.cuda.empty_cache() after each epoch.

Restart the Python process and ensure no other processes are using the GPU. Use 'nvidia-smi' to check memory usage and kill competing processes with 'kill -9 <PID>'. Then re-run the code.

Dead Ends

Common approaches that don't work:

90% fail
Increasing batch size in the model makes the problem worse by consuming more GPU memory, not less.
70% fail
Setting torch.backends.cudnn.enabled = False disables cuDNN but cuBLAS is still used internally; this doesn't free the memory needed by cuBLAS.