CUBLAS_STATUS_ALLOC_FAILED cuda resource_error ai_generated true

运行时错误：CUDA 错误：调用 cublasLtMatmulAlgoGetHeuristic 时返回 CUBLAS_STATUS_ALLOC_FAILED

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasLtMatmulAlgoGetHeuristic

ID: cuda/cublas-alloc-failed-cublaslt

其他格式: JSON · Markdown 中文 · English

78%修复率

86%置信度

1证据数

2024-01-15首次发现

版本兼容性

版本	状态	引入	弃用	备注
CUDA 12.0	active	—	—	—
CUDA 12.3	active	—	—	—
cuBLASLt 0.8	active	—	—	—
PyTorch 2.2	active	—	—	—

根因分析

cuBLASLt 的矩阵乘法算法启发式搜索失败，原因是 GPU 内存不足，通常由内存碎片化或工作区需求过大引起。

English

cuBLASLt heuristic search for matrix multiplication algorithms fails due to insufficient GPU memory, often caused by memory fragmentation or large workspace requirements.

generic

官方文档

https://docs.nvidia.com/cuda/cublaslt/index.html#cublasltmatmulalgogetheuristic

解决方案

通过降低批量大小或使用梯度检查点来减少内存使用。例如，在 PyTorch 中：model = torch.utils.checkpoint.checkpoint(model, *inputs)。这会释放内存供启发式分配使用。

在操作前清除 GPU 缓存：torch.cuda.empty_cache()。这可以整理内存碎片，释放 cuBLASLt 工作区所需的连续块。

通过设置环境变量 CUBLASLT_HEURISTIC_MODE=1 限制搜索的算法数量。这会减少启发式搜索期间的工作区分配大小。

无效尝试

常见但无效的做法:

95% 失败
Larger batch sizes increase memory usage, exacerbating the allocation failure. The error occurs due to insufficient memory for workspace, not underutilization.
80% 失败
The workspace config controls the internal buffer size but doesn't directly fix allocation failures during heuristic search; it may even increase memory pressure.
50% 失败
cuBLASLt is often the default for certain operations; disabling it may fall back to cuBLAS but can cause performance degradation or different errors.