CUBLAS_STATUS_NOT_SUPPORTED cuda runtime_error ai_generated true

运行时错误：调用 cublasGemmEx 时返回 CUBLAS_STATUS_NOT_SUPPORTED

RuntimeError: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx( handle, opa, opb, m, n, k, &alpha, a, atype, lda, b, btype, ldb, &beta, c, ctype, ldc, compute_type, algo)

ID: cuda/cublas-gemm-params-unsupported-combination

其他格式: JSON · Markdown 中文 · English

82%修复率

88%置信度

1证据数

2023-05-15首次发现

版本兼容性

版本	状态	引入	弃用	备注
CUDA 11.8	active	—	—	—
CUDA 12.1	active	—	—	—
cuBLAS 11.11	active	—	—	—
cuBLAS 12.0	active	—	—	—

根因分析

当前 GPU 架构上的 cuBLAS 库不支持输入矩阵数据类型（atype、btype、ctype）与计算类型的组合。

English

The combination of input matrix data types (atype, btype, ctype) and compute type is not supported by the cuBLAS library on the current GPU architecture.

generic

官方文档

https://docs.nvidia.com/cuda/cublas/index.html#cublas-status-not-supported

解决方案

使用 torch.cuda.is_bf16_supported() 检查 bfloat16 支持情况后再使用。例如：if torch.cuda.is_bf16_supported(): model = model.to(torch.bfloat16) else: model = model.to(torch.float16)

显式设置计算类型以匹配输入类型。在 PyTorch 中，使用 torch.set_default_dtype(torch.float32) 或将张量转换为支持的类型组合，如 Ampere+ GPU 上的 float16。

通过设置环境变量 CUBLAS_WORKSPACE_CONFIG=:4096:8 禁用 cuBLAS 并回退到自定义内核。这会强制 cuBLAS 使用可能支持该类型组合的不同代码路径。

无效尝试

常见但无效的做法:

65% 失败
CUDA version alone doesn't guarantee support; the GPU's compute capability (e.g., sm_70 vs sm_80) determines which type combinations are valid.
90% 失败
The algorithm parameter doesn't change data type compatibility; it only affects performance and precision for supported type combinations.
40% 失败
While float32 is widely supported, this workaround may cause out-of-memory errors for large models or reduce performance if the original types were optimized (e.g., half precision).