CUBLAS_STATUS_INVALID_VALUE cuda runtime_error ai_generated true

RuntimeError: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, batchCount )

ID: cuda/cublas-gemm-broadcast-dimension-mismatch

Also available as: JSON · Markdown · 中文

80%Fix Rate

85%Confidence

1Evidence

2023-08-15First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
CUDA 11.8	active	—	—	—
CUDA 12.1	active	—	—	—
cuBLAS 11.11	active	—	—	—
PyTorch 2.0.1	active	—	—	—

Root Cause

The GEMM dimensions (m, n, k) derived from tensor shapes are incompatible or non-positive, often due to a batch broadcast operation that produces a dimension of zero or a leading dimension (lda/ldb/ldc) violation.

generic

中文

从张量形状推导出的 GEMM 维度（m, n, k）不兼容或非正数，通常是由于批次广播操作产生了零维度或前导维度（lda/ldb/ldc）冲突。

Official Documentation

https://docs.nvidia.com/cuda/cublas/index.html#cublas-t-status

Workarounds

85% success Print and verify the shapes of all tensors passed to the GEMM operation before calling the matmul. Ensure that the last dimensions of the two input matrices are compatible (e.g., for `torch.matmul(A, B)`, A.shape[-1] == B.shape[-2]) and that no dimension is zero. Example: `print(A.shape, B.shape); assert A.shape[-1] == B.shape[-2] and all(d > 0 for d in A.shape + B.shape)`.
```
Print and verify the shapes of all tensors passed to the GEMM operation before calling the matmul. Ensure that the last dimensions of the two input matrices are compatible (e.g., for `torch.matmul(A, B)`, A.shape[-1] == B.shape[-2]) and that no dimension is zero. Example: `print(A.shape, B.shape); assert A.shape[-1] == B.shape[-2] and all(d > 0 for d in A.shape + B.shape)`.
```
75% success If using batched operations with broadcasting, explicitly expand the smaller tensor to match the batch dimensions using `torch.broadcast_to` or `unsqueeze` + `expand` before the matmul, ensuring all batch dimensions are consistent.
```
If using batched operations with broadcasting, explicitly expand the smaller tensor to match the batch dimensions using `torch.broadcast_to` or `unsqueeze` + `expand` before the matmul, ensuring all batch dimensions are consistent.
```
70% success Set environment variable `CUBLAS_LOGINFO=1` to enable cuBLAS logging and capture the exact GEMM parameters (m, n, k, lda, etc.) being passed; cross-check these against the tensor shapes.
```
Set environment variable `CUBLAS_LOGINFO=1` to enable cuBLAS logging and capture the exact GEMM parameters (m, n, k, lda, etc.) being passed; cross-check these against the tensor shapes.
```

中文步骤

在调用矩阵乘法前打印并验证所有传入 GEMM 操作的张量形状。确保两个输入矩阵的最后一维兼容（例如，对于 `torch.matmul(A, B)`，A.shape[-1] == B.shape[-2]），且没有任何维度为零。示例：`print(A.shape, B.shape); assert A.shape[-1] == B.shape[-2] and all(d > 0 for d in A.shape + B.shape)`。

如果使用带广播的批处理操作，在矩阵乘法前显式使用 `torch.broadcast_to` 或 `unsqueeze` + `expand` 将较小的张量扩展到匹配的批次维度，确保所有批次维度一致。

设置环境变量 `CUBLAS_LOGINFO=1` 启用 cuBLAS 日志记录，捕获传递的确切 GEMM 参数（m, n, k, lda 等）；与张量形状交叉检查。

Dead Ends

Common approaches that don't work:

Restarting the kernel or clearing CUDA cache 95% fail
The error is a dimension validation failure, not a memory or state issue; restarting does not fix the invalid tensor shapes.
Increasing batch size to avoid zero-sized batches 80% fail
The error is not about batch size being zero per se, but about a mismatch in m/n/k derived from batched tensor broadcasting; arbitrary batch size changes can mask the real shape bug.
Downgrading cuBLAS to an older version 90% fail
The dimension validation is consistent across cuBLAS versions; older versions may have the same check or even stricter checks.