# 运行时错误：调用 cublasGemmStridedBatchedEx 时 CUBLAS_STATUS_INVALID_VALUE，batch_count > 0 但 A/B/C 矩阵维度不兼容

- **ID:** `cuda/cublas-gemm-batched-wrong-rank`
- **领域:** cuda
- **类别:** runtime_error
- **错误码:** `CUBLAS_STATUS_INVALID_VALUE`
- **验证级别:** ai_generated
- **修复率:** 80%

## 根因

cuBLAS 批量 GEMM 要求矩阵 A、B 和 C 的前导维度（lda、ldb、ldc）和步幅与矩阵维度和批次数一致；大小不匹配会导致无效值错误。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| CUDA 11.7 | active | — | — |
| CUDA 12.0 | active | — | — |
| cuBLAS 11.10 | active | — | — |
| cuBLAS 12.0 | active | — | — |

## 解决方案

1. ```
   Verify that lda >= m, ldb >= k, ldc >= m, and that strideA >= m*k, strideB >= k*n, strideC >= m*n for each batch. Adjust matrix allocation accordingly.
   ```
2. ```
   Use PyTorch's `torch.bmm` or `torch.matmul` with batched tensors instead of raw cuBLAS calls, as these handle dimension validation internally.
   ```

## 无效尝试

- **** — Transposition changes the memory layout and may cause silent data corruption; the correct fix is to compute proper strides and leading dimensions. (90% 失败率)
- **** — This bypasses the error but loses the performance benefit of batching; the underlying dimension issue remains for actual batched use. (70% 失败率)
