CUBLAS_STATUS_INVALID_VALUE
cuda
runtime_error
ai_generated
true
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemmStridedBatchedEx with invalid broadcast dimensions
ID: cuda/cublas-gemm-broadcast-invalid-config
82%Fix Rate
85%Confidence
1Evidence
2024-06-15First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| CUDA 11.8 | active | — | — | — |
| CUDA 12.1 | active | — | — | — |
| cuBLAS 11.11.3.6 | active | — | — | — |
| PyTorch 2.1.0 | active | — | — | — |
Root Cause
cuBLAS batched GEMM requires strictly matching batch dimensions for strided inputs; implicit broadcasting is not supported, causing invalid value error when batch counts differ between A and B.
generic中文
cuBLAS批处理GEMM要求带步长输入的批次维度严格匹配;不支持隐式广播,当A和B的批次计数不同时会导致无效值错误。
Official Documentation
https://docs.nvidia.com/cuda/cublas/index.html#cublas-t-batchedWorkarounds
-
88% success Explicitly expand the smaller batch dimension to match using torch.broadcast_to or .expand before calling the batched GEMM. Example: if A has shape (batch_a, m, k) and B has shape (batch_b, k, n) with batch_a != batch_b, expand A to (max(batch_a,batch_b), m, k) using A = A.expand(max_batch, -1, -1).
Explicitly expand the smaller batch dimension to match using torch.broadcast_to or .expand before calling the batched GEMM. Example: if A has shape (batch_a, m, k) and B has shape (batch_b, k, n) with batch_a != batch_b, expand A to (max(batch_a,batch_b), m, k) using A = A.expand(max_batch, -1, -1).
-
80% success Use torch.bmm instead of torch.baddbmm with explicit broadcasting via unsqueeze: C = torch.bmm(A.unsqueeze(1).expand(-1, B.size(0), -1, -1).reshape(-1, m, k), B.unsqueeze(0).expand(A.size(0), -1, -1, -1).reshape(-1, k, n)).
Use torch.bmm instead of torch.baddbmm with explicit broadcasting via unsqueeze: C = torch.bmm(A.unsqueeze(1).expand(-1, B.size(0), -1, -1).reshape(-1, m, k), B.unsqueeze(0).expand(A.size(0), -1, -1, -1).reshape(-1, k, n)).
中文步骤
Explicitly expand the smaller batch dimension to match using torch.broadcast_to or .expand before calling the batched GEMM. Example: if A has shape (batch_a, m, k) and B has shape (batch_b, k, n) with batch_a != batch_b, expand A to (max(batch_a,batch_b), m, k) using A = A.expand(max_batch, -1, -1).
Use torch.bmm instead of torch.baddbmm with explicit broadcasting via unsqueeze: C = torch.bmm(A.unsqueeze(1).expand(-1, B.size(0), -1, -1).reshape(-1, m, k), B.unsqueeze(0).expand(A.size(0), -1, -1, -1).reshape(-1, k, n)).
Dead Ends
Common approaches that don't work:
-
95% fail
The error is not about precision but about batch dimension mismatch; casting does not change the shape.
-
98% fail
Out-of-memory is not the root cause; the kernel fails validation before execution.