运行时错误:调用 cublasGemmEx 时返回 CUBLAS_STATUS_NOT_SUPPORTED
RuntimeError: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx( handle, opa, opb, m, n, k, &alpha, a, atype, lda, b, btype, ldb, &beta, c, ctype, ldc, compute_type, algo)
ID: cuda/cublas-gemm-params-unsupported-combination
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| CUDA 11.8 | active | — | — | — |
| CUDA 12.1 | active | — | — | — |
| cuBLAS 11.11 | active | — | — | — |
| cuBLAS 12.0 | active | — | — | — |
根因分析
当前 GPU 架构上的 cuBLAS 库不支持输入矩阵数据类型(atype、btype、ctype)与计算类型的组合。
English
The combination of input matrix data types (atype, btype, ctype) and compute type is not supported by the cuBLAS library on the current GPU architecture.
官方文档
https://docs.nvidia.com/cuda/cublas/index.html#cublas-status-not-supported解决方案
-
使用 torch.cuda.is_bf16_supported() 检查 bfloat16 支持情况后再使用。例如:if torch.cuda.is_bf16_supported(): model = model.to(torch.bfloat16) else: model = model.to(torch.float16)
-
显式设置计算类型以匹配输入类型。在 PyTorch 中,使用 torch.set_default_dtype(torch.float32) 或将张量转换为支持的类型组合,如 Ampere+ GPU 上的 float16。
-
通过设置环境变量 CUBLAS_WORKSPACE_CONFIG=:4096:8 禁用 cuBLAS 并回退到自定义内核。这会强制 cuBLAS 使用可能支持该类型组合的不同代码路径。
无效尝试
常见但无效的做法:
-
65% 失败
CUDA version alone doesn't guarantee support; the GPU's compute capability (e.g., sm_70 vs sm_80) determines which type combinations are valid.
-
90% 失败
The algorithm parameter doesn't change data type compatibility; it only affects performance and precision for supported type combinations.
-
40% 失败
While float32 is widely supported, this workaround may cause out-of-memory errors for large models or reduce performance if the original types were optimized (e.g., half precision).