CUBLAS_STATUS_ALLOC_FAILED cuda resource_error ai_generated true

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasLtMatmulAlgoGetHeuristic

ID: cuda/cublas-alloc-failed-cublaslt

Also available as: JSON · Markdown · 中文
78%Fix Rate
86%Confidence
1Evidence
2024-01-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
CUDA 12.0 active
CUDA 12.3 active
cuBLASLt 0.8 active
PyTorch 2.2 active

Root Cause

cuBLASLt heuristic search for matrix multiplication algorithms fails due to insufficient GPU memory, often caused by memory fragmentation or large workspace requirements.

generic

中文

cuBLASLt 的矩阵乘法算法启发式搜索失败,原因是 GPU 内存不足,通常由内存碎片化或工作区需求过大引起。

Official Documentation

https://docs.nvidia.com/cuda/cublaslt/index.html#cublasltmatmulalgogetheuristic

Workarounds

  1. 80% success Reduce memory usage by lowering batch size or using gradient checkpointing. For example, in PyTorch: model = torch.utils.checkpoint.checkpoint(model, *inputs). This frees memory for the heuristic allocation.
    Reduce memory usage by lowering batch size or using gradient checkpointing. For example, in PyTorch: model = torch.utils.checkpoint.checkpoint(model, *inputs). This frees memory for the heuristic allocation.
  2. 70% success Clear GPU cache before the operation: torch.cuda.empty_cache(). This can defragment memory and free up contiguous blocks needed for cuBLASLt workspace.
    Clear GPU cache before the operation: torch.cuda.empty_cache(). This can defragment memory and free up contiguous blocks needed for cuBLASLt workspace.
  3. 75% success Restrict the number of algorithms searched by setting the environment variable: CUBLASLT_HEURISTIC_MODE=1. This reduces workspace allocation size during the heuristic search.
    Restrict the number of algorithms searched by setting the environment variable: CUBLASLT_HEURISTIC_MODE=1. This reduces workspace allocation size during the heuristic search.

中文步骤

  1. 通过降低批量大小或使用梯度检查点来减少内存使用。例如,在 PyTorch 中:model = torch.utils.checkpoint.checkpoint(model, *inputs)。这会释放内存供启发式分配使用。
  2. 在操作前清除 GPU 缓存:torch.cuda.empty_cache()。这可以整理内存碎片,释放 cuBLASLt 工作区所需的连续块。
  3. 通过设置环境变量 CUBLASLT_HEURISTIC_MODE=1 限制搜索的算法数量。这会减少启发式搜索期间的工作区分配大小。

Dead Ends

Common approaches that don't work:

  1. 95% fail

    Larger batch sizes increase memory usage, exacerbating the allocation failure. The error occurs due to insufficient memory for workspace, not underutilization.

  2. 80% fail

    The workspace config controls the internal buffer size but doesn't directly fix allocation failures during heuristic search; it may even increase memory pressure.

  3. 50% fail

    cuBLASLt is often the default for certain operations; disabling it may fall back to cuBLAS but can cause performance degradation or different errors.