RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasLtMatmulAlgoGetHeuristic
ID: cuda/cublas-alloc-failed-cublaslt
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| CUDA 12.0 | active | — | — | — |
| CUDA 12.3 | active | — | — | — |
| cuBLASLt 0.8 | active | — | — | — |
| PyTorch 2.2 | active | — | — | — |
Root Cause
cuBLASLt heuristic search for matrix multiplication algorithms fails due to insufficient GPU memory, often caused by memory fragmentation or large workspace requirements.
generic中文
cuBLASLt 的矩阵乘法算法启发式搜索失败,原因是 GPU 内存不足,通常由内存碎片化或工作区需求过大引起。
Official Documentation
https://docs.nvidia.com/cuda/cublaslt/index.html#cublasltmatmulalgogetheuristicWorkarounds
-
80% success Reduce memory usage by lowering batch size or using gradient checkpointing. For example, in PyTorch: model = torch.utils.checkpoint.checkpoint(model, *inputs). This frees memory for the heuristic allocation.
Reduce memory usage by lowering batch size or using gradient checkpointing. For example, in PyTorch: model = torch.utils.checkpoint.checkpoint(model, *inputs). This frees memory for the heuristic allocation.
-
70% success Clear GPU cache before the operation: torch.cuda.empty_cache(). This can defragment memory and free up contiguous blocks needed for cuBLASLt workspace.
Clear GPU cache before the operation: torch.cuda.empty_cache(). This can defragment memory and free up contiguous blocks needed for cuBLASLt workspace.
-
75% success Restrict the number of algorithms searched by setting the environment variable: CUBLASLT_HEURISTIC_MODE=1. This reduces workspace allocation size during the heuristic search.
Restrict the number of algorithms searched by setting the environment variable: CUBLASLT_HEURISTIC_MODE=1. This reduces workspace allocation size during the heuristic search.
中文步骤
通过降低批量大小或使用梯度检查点来减少内存使用。例如,在 PyTorch 中:model = torch.utils.checkpoint.checkpoint(model, *inputs)。这会释放内存供启发式分配使用。
在操作前清除 GPU 缓存:torch.cuda.empty_cache()。这可以整理内存碎片,释放 cuBLASLt 工作区所需的连续块。
通过设置环境变量 CUBLASLT_HEURISTIC_MODE=1 限制搜索的算法数量。这会减少启发式搜索期间的工作区分配大小。
Dead Ends
Common approaches that don't work:
-
95% fail
Larger batch sizes increase memory usage, exacerbating the allocation failure. The error occurs due to insufficient memory for workspace, not underutilization.
-
80% fail
The workspace config controls the internal buffer size but doesn't directly fix allocation failures during heuristic search; it may even increase memory pressure.
-
50% fail
cuBLASLt is often the default for certain operations; disabling it may fall back to cuBLAS but can cause performance degradation or different errors.