# RuntimeError: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, batchCount )

- **ID:** `cuda/cublas-gemm-broadcast-dimension-mismatch`
- **Domain:** cuda
- **Category:** runtime_error
- **Error Code:** `CUBLAS_STATUS_INVALID_VALUE`
- **Verification:** ai_generated
- **Fix Rate:** 80%

## Root Cause

The GEMM dimensions (m, n, k) derived from tensor shapes are incompatible or non-positive, often due to a batch broadcast operation that produces a dimension of zero or a leading dimension (lda/ldb/ldc) violation.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| CUDA 11.8 | active | — | — |
| CUDA 12.1 | active | — | — |
| cuBLAS 11.11 | active | — | — |
| PyTorch 2.0.1 | active | — | — |

## Workarounds

1. **Print and verify the shapes of all tensors passed to the GEMM operation before calling the matmul. Ensure that the last dimensions of the two input matrices are compatible (e.g., for `torch.matmul(A, B)`, A.shape[-1] == B.shape[-2]) and that no dimension is zero. Example: `print(A.shape, B.shape); assert A.shape[-1] == B.shape[-2] and all(d > 0 for d in A.shape + B.shape)`.** (85% success)
   ```
   Print and verify the shapes of all tensors passed to the GEMM operation before calling the matmul. Ensure that the last dimensions of the two input matrices are compatible (e.g., for `torch.matmul(A, B)`, A.shape[-1] == B.shape[-2]) and that no dimension is zero. Example: `print(A.shape, B.shape); assert A.shape[-1] == B.shape[-2] and all(d > 0 for d in A.shape + B.shape)`.
   ```
2. **If using batched operations with broadcasting, explicitly expand the smaller tensor to match the batch dimensions using `torch.broadcast_to` or `unsqueeze` + `expand` before the matmul, ensuring all batch dimensions are consistent.** (75% success)
   ```
   If using batched operations with broadcasting, explicitly expand the smaller tensor to match the batch dimensions using `torch.broadcast_to` or `unsqueeze` + `expand` before the matmul, ensuring all batch dimensions are consistent.
   ```
3. **Set environment variable `CUBLAS_LOGINFO=1` to enable cuBLAS logging and capture the exact GEMM parameters (m, n, k, lda, etc.) being passed; cross-check these against the tensor shapes.** (70% success)
   ```
   Set environment variable `CUBLAS_LOGINFO=1` to enable cuBLAS logging and capture the exact GEMM parameters (m, n, k, lda, etc.) being passed; cross-check these against the tensor shapes.
   ```

## Dead Ends

- **Restarting the kernel or clearing CUDA cache** — The error is a dimension validation failure, not a memory or state issue; restarting does not fix the invalid tensor shapes. (95% fail)
- **Increasing batch size to avoid zero-sized batches** — The error is not about batch size being zero per se, but about a mismatch in m/n/k derived from batched tensor broadcasting; arbitrary batch size changes can mask the real shape bug. (80% fail)
- **Downgrading cuBLAS to an older version** — The dimension validation is consistent across cuBLAS versions; older versions may have the same check or even stricter checks. (90% fail)
