# 运行时错误：CUDA错误：调用cublasSgemmStridedBatchedEx时因无效广播维度导致CUBLAS_STATUS_INVALID_VALUE

- **ID:** `cuda/cublas-gemm-broadcast-invalid-config`
- **领域:** cuda
- **类别:** runtime_error
- **错误码:** `CUBLAS_STATUS_INVALID_VALUE`
- **验证级别:** ai_generated
- **修复率:** 82%

## 根因

cuBLAS批处理GEMM要求带步长输入的批次维度严格匹配；不支持隐式广播，当A和B的批次计数不同时会导致无效值错误。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| CUDA 11.8 | active | — | — |
| CUDA 12.1 | active | — | — |
| cuBLAS 11.11.3.6 | active | — | — |
| PyTorch 2.1.0 | active | — | — |

## 解决方案

1. ```
   Explicitly expand the smaller batch dimension to match using torch.broadcast_to or .expand before calling the batched GEMM. Example: if A has shape (batch_a, m, k) and B has shape (batch_b, k, n) with batch_a != batch_b, expand A to (max(batch_a,batch_b), m, k) using A = A.expand(max_batch, -1, -1).
   ```
2. ```
   Use torch.bmm instead of torch.baddbmm with explicit broadcasting via unsqueeze: C = torch.bmm(A.unsqueeze(1).expand(-1, B.size(0), -1, -1).reshape(-1, m, k), B.unsqueeze(0).expand(A.size(0), -1, -1, -1).reshape(-1, k, n)).
   ```

## 无效尝试

- **** — The error is not about precision but about batch dimension mismatch; casting does not change the shape. (95% 失败率)
- **** — Out-of-memory is not the root cause; the kernel fails validation before execution. (98% 失败率)
