# RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemmStridedBatchedEx with invalid broadcast dimensions

- **ID:** `cuda/cublas-gemm-broadcast-invalid-config`
- **Domain:** cuda
- **Category:** runtime_error
- **Error Code:** `CUBLAS_STATUS_INVALID_VALUE`
- **Verification:** ai_generated
- **Fix Rate:** 82%

## Root Cause

cuBLAS batched GEMM requires strictly matching batch dimensions for strided inputs; implicit broadcasting is not supported, causing invalid value error when batch counts differ between A and B.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| CUDA 11.8 | active | — | — |
| CUDA 12.1 | active | — | — |
| cuBLAS 11.11.3.6 | active | — | — |
| PyTorch 2.1.0 | active | — | — |

## Workarounds

1. **Explicitly expand the smaller batch dimension to match using torch.broadcast_to or .expand before calling the batched GEMM. Example: if A has shape (batch_a, m, k) and B has shape (batch_b, k, n) with batch_a != batch_b, expand A to (max(batch_a,batch_b), m, k) using A = A.expand(max_batch, -1, -1).** (88% success)
   ```
   Explicitly expand the smaller batch dimension to match using torch.broadcast_to or .expand before calling the batched GEMM. Example: if A has shape (batch_a, m, k) and B has shape (batch_b, k, n) with batch_a != batch_b, expand A to (max(batch_a,batch_b), m, k) using A = A.expand(max_batch, -1, -1).
   ```
2. **Use torch.bmm instead of torch.baddbmm with explicit broadcasting via unsqueeze: C = torch.bmm(A.unsqueeze(1).expand(-1, B.size(0), -1, -1).reshape(-1, m, k), B.unsqueeze(0).expand(A.size(0), -1, -1, -1).reshape(-1, k, n)).** (80% success)
   ```
   Use torch.bmm instead of torch.baddbmm with explicit broadcasting via unsqueeze: C = torch.bmm(A.unsqueeze(1).expand(-1, B.size(0), -1, -1).reshape(-1, m, k), B.unsqueeze(0).expand(A.size(0), -1, -1, -1).reshape(-1, k, n)).
   ```

## Dead Ends

- **** — The error is not about precision but about batch dimension mismatch; casting does not change the shape. (95% fail)
- **** — Out-of-memory is not the root cause; the kernel fails validation before execution. (98% fail)