# 运行时错误：调用 cublasSgemmStridedBatched 时出现 CUBLAS_STATUS_INVALID_VALUE

- **ID:** `cuda/cublas-gemm-broadcast-dimension-mismatch`
- **领域:** cuda
- **类别:** runtime_error
- **错误码:** `CUBLAS_STATUS_INVALID_VALUE`
- **验证级别:** ai_generated
- **修复率:** 80%

## 根因

从张量形状推导出的 GEMM 维度（m, n, k）不兼容或非正数，通常是由于批次广播操作产生了零维度或前导维度（lda/ldb/ldc）冲突。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| CUDA 11.8 | active | — | — |
| CUDA 12.1 | active | — | — |
| cuBLAS 11.11 | active | — | — |
| PyTorch 2.0.1 | active | — | — |

## 解决方案

1. ```
   在调用矩阵乘法前打印并验证所有传入 GEMM 操作的张量形状。确保两个输入矩阵的最后一维兼容（例如，对于 `torch.matmul(A, B)`，A.shape[-1] == B.shape[-2]），且没有任何维度为零。示例：`print(A.shape, B.shape); assert A.shape[-1] == B.shape[-2] and all(d > 0 for d in A.shape + B.shape)`。
   ```
2. ```
   如果使用带广播的批处理操作，在矩阵乘法前显式使用 `torch.broadcast_to` 或 `unsqueeze` + `expand` 将较小的张量扩展到匹配的批次维度，确保所有批次维度一致。
   ```
3. ```
   设置环境变量 `CUBLAS_LOGINFO=1` 启用 cuBLAS 日志记录，捕获传递的确切 GEMM 参数（m, n, k, lda 等）；与张量形状交叉检查。
   ```

## 无效尝试

- **Restarting the kernel or clearing CUDA cache** — The error is a dimension validation failure, not a memory or state issue; restarting does not fix the invalid tensor shapes. (95% 失败率)
- **Increasing batch size to avoid zero-sized batches** — The error is not about batch size being zero per se, but about a mismatch in m/n/k derived from batched tensor broadcasting; arbitrary batch size changes can mask the real shape bug. (80% 失败率)
- **Downgrading cuBLAS to an older version** — The dimension validation is consistent across cuBLAS versions; older versions may have the same check or even stricter checks. (90% 失败率)