CUBLAS_STATUS_NOT_SUPPORTED cuda runtime_error ai_generated true

RuntimeError: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx( handle, opa, opb, m, n, k, &alpha, a, atype, lda, b, btype, ldb, &beta, c, ctype, ldc, compute_type, algo)

ID: cuda/cublas-gemm-params-unsupported-combination

Also available as: JSON · Markdown · 中文
82%Fix Rate
88%Confidence
1Evidence
2023-05-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
CUDA 11.8 active
CUDA 12.1 active
cuBLAS 11.11 active
cuBLAS 12.0 active

Root Cause

The combination of input matrix data types (atype, btype, ctype) and compute type is not supported by the cuBLAS library on the current GPU architecture.

generic

中文

当前 GPU 架构上的 cuBLAS 库不支持输入矩阵数据类型(atype、btype、ctype)与计算类型的组合。

Official Documentation

https://docs.nvidia.com/cuda/cublas/index.html#cublas-status-not-supported

Workarounds

  1. 85% success Use torch.cuda.is_bf16_supported() to check bfloat16 support before using it. For example: if torch.cuda.is_bf16_supported(): model = model.to(torch.bfloat16) else: model = model.to(torch.float16)
    Use torch.cuda.is_bf16_supported() to check bfloat16 support before using it. For example: if torch.cuda.is_bf16_supported(): model = model.to(torch.bfloat16) else: model = model.to(torch.float16)
  2. 78% success Explicitly set the compute type to match the input type. In PyTorch, use torch.set_default_dtype(torch.float32) or cast tensors to a supported combination like float16 for Ampere+ GPUs.
    Explicitly set the compute type to match the input type. In PyTorch, use torch.set_default_dtype(torch.float32) or cast tensors to a supported combination like float16 for Ampere+ GPUs.
  3. 70% success Disable cuBLAS and fall back to a custom kernel by setting environment variable: CUBLAS_WORKSPACE_CONFIG=:4096:8. This forces cuBLAS to use a different code path that may support the type combination.
    Disable cuBLAS and fall back to a custom kernel by setting environment variable: CUBLAS_WORKSPACE_CONFIG=:4096:8. This forces cuBLAS to use a different code path that may support the type combination.

中文步骤

  1. 使用 torch.cuda.is_bf16_supported() 检查 bfloat16 支持情况后再使用。例如:if torch.cuda.is_bf16_supported(): model = model.to(torch.bfloat16) else: model = model.to(torch.float16)
  2. 显式设置计算类型以匹配输入类型。在 PyTorch 中,使用 torch.set_default_dtype(torch.float32) 或将张量转换为支持的类型组合,如 Ampere+ GPU 上的 float16。
  3. 通过设置环境变量 CUBLAS_WORKSPACE_CONFIG=:4096:8 禁用 cuBLAS 并回退到自定义内核。这会强制 cuBLAS 使用可能支持该类型组合的不同代码路径。

Dead Ends

Common approaches that don't work:

  1. 65% fail

    CUDA version alone doesn't guarantee support; the GPU's compute capability (e.g., sm_70 vs sm_80) determines which type combinations are valid.

  2. 90% fail

    The algorithm parameter doesn't change data type compatibility; it only affects performance and precision for supported type combinations.

  3. 40% fail

    While float32 is widely supported, this workaround may cause out-of-memory errors for large models or reduce performance if the original types were optimized (e.g., half precision).