# RuntimeError: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx( handle, opa, opb, m, n, k, &alpha, a, atype, lda, b, btype, ldb, &beta, c, ctype, ldc, compute_type, algo)

- **ID:** `cuda/cublas-gemm-params-unsupported-combination`
- **Domain:** cuda
- **Category:** runtime_error
- **Error Code:** `CUBLAS_STATUS_NOT_SUPPORTED`
- **Verification:** ai_generated
- **Fix Rate:** 82%

## Root Cause

The combination of input matrix data types (atype, btype, ctype) and compute type is not supported by the cuBLAS library on the current GPU architecture.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| CUDA 11.8 | active | — | — |
| CUDA 12.1 | active | — | — |
| cuBLAS 11.11 | active | — | — |
| cuBLAS 12.0 | active | — | — |

## Workarounds

1. **Use torch.cuda.is_bf16_supported() to check bfloat16 support before using it. For example: if torch.cuda.is_bf16_supported(): model = model.to(torch.bfloat16) else: model = model.to(torch.float16)** (85% success)
   ```
   Use torch.cuda.is_bf16_supported() to check bfloat16 support before using it. For example: if torch.cuda.is_bf16_supported(): model = model.to(torch.bfloat16) else: model = model.to(torch.float16)
   ```
2. **Explicitly set the compute type to match the input type. In PyTorch, use torch.set_default_dtype(torch.float32) or cast tensors to a supported combination like float16 for Ampere+ GPUs.** (78% success)
   ```
   Explicitly set the compute type to match the input type. In PyTorch, use torch.set_default_dtype(torch.float32) or cast tensors to a supported combination like float16 for Ampere+ GPUs.
   ```
3. **Disable cuBLAS and fall back to a custom kernel by setting environment variable: CUBLAS_WORKSPACE_CONFIG=:4096:8. This forces cuBLAS to use a different code path that may support the type combination.** (70% success)
   ```
   Disable cuBLAS and fall back to a custom kernel by setting environment variable: CUBLAS_WORKSPACE_CONFIG=:4096:8. This forces cuBLAS to use a different code path that may support the type combination.
   ```

## Dead Ends

- **** — CUDA version alone doesn't guarantee support; the GPU's compute capability (e.g., sm_70 vs sm_80) determines which type combinations are valid. (65% fail)
- **** — The algorithm parameter doesn't change data type compatibility; it only affects performance and precision for supported type combinations. (90% fail)
- **** — While float32 is widely supported, this workaround may cause out-of-memory errors for large models or reduce performance if the original types were optimized (e.g., half precision). (40% fail)
