RuntimeError: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx( handle, opa, opb, m, n, k, &alpha, a, atype, lda, b, btype, ldb, &beta, c, ctype, ldc, compute_type, algo)
ID: cuda/cublas-gemm-params-unsupported-combination
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| CUDA 11.8 | active | — | — | — |
| CUDA 12.1 | active | — | — | — |
| cuBLAS 11.11 | active | — | — | — |
| cuBLAS 12.0 | active | — | — | — |
Root Cause
The combination of input matrix data types (atype, btype, ctype) and compute type is not supported by the cuBLAS library on the current GPU architecture.
generic中文
当前 GPU 架构上的 cuBLAS 库不支持输入矩阵数据类型(atype、btype、ctype)与计算类型的组合。
Official Documentation
https://docs.nvidia.com/cuda/cublas/index.html#cublas-status-not-supportedWorkarounds
-
85% success Use torch.cuda.is_bf16_supported() to check bfloat16 support before using it. For example: if torch.cuda.is_bf16_supported(): model = model.to(torch.bfloat16) else: model = model.to(torch.float16)
Use torch.cuda.is_bf16_supported() to check bfloat16 support before using it. For example: if torch.cuda.is_bf16_supported(): model = model.to(torch.bfloat16) else: model = model.to(torch.float16)
-
78% success Explicitly set the compute type to match the input type. In PyTorch, use torch.set_default_dtype(torch.float32) or cast tensors to a supported combination like float16 for Ampere+ GPUs.
Explicitly set the compute type to match the input type. In PyTorch, use torch.set_default_dtype(torch.float32) or cast tensors to a supported combination like float16 for Ampere+ GPUs.
-
70% success Disable cuBLAS and fall back to a custom kernel by setting environment variable: CUBLAS_WORKSPACE_CONFIG=:4096:8. This forces cuBLAS to use a different code path that may support the type combination.
Disable cuBLAS and fall back to a custom kernel by setting environment variable: CUBLAS_WORKSPACE_CONFIG=:4096:8. This forces cuBLAS to use a different code path that may support the type combination.
中文步骤
使用 torch.cuda.is_bf16_supported() 检查 bfloat16 支持情况后再使用。例如:if torch.cuda.is_bf16_supported(): model = model.to(torch.bfloat16) else: model = model.to(torch.float16)
显式设置计算类型以匹配输入类型。在 PyTorch 中,使用 torch.set_default_dtype(torch.float32) 或将张量转换为支持的类型组合,如 Ampere+ GPU 上的 float16。
通过设置环境变量 CUBLAS_WORKSPACE_CONFIG=:4096:8 禁用 cuBLAS 并回退到自定义内核。这会强制 cuBLAS 使用可能支持该类型组合的不同代码路径。
Dead Ends
Common approaches that don't work:
-
65% fail
CUDA version alone doesn't guarantee support; the GPU's compute capability (e.g., sm_70 vs sm_80) determines which type combinations are valid.
-
90% fail
The algorithm parameter doesn't change data type compatibility; it only affects performance and precision for supported type combinations.
-
40% fail
While float32 is widely supported, this workaround may cause out-of-memory errors for large models or reduce performance if the original types were optimized (e.g., half precision).