CUSOLVER_STATUS_INTERNAL_ERROR cuda runtime_error ai_generated true

RuntimeError: cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR when computing SVD of a singular matrix

ID: cuda/cusolver-internal-error-on-svd

Also available as: JSON · Markdown · 中文
76%Fix Rate
84%Confidence
1Evidence
2025-03-12First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
CUDA 12.4 active
cuSolver 11.5.1 active
PyTorch 2.3.0 active

Root Cause

cuSolver's SVD routine (gesvdj or gesvd) fails internally when the input matrix is exactly singular or has NaN/inf values, causing a buffer overflow or division by zero in the iterative solver.

generic

中文

当输入矩阵恰好是奇异矩阵或包含NaN/inf值时,cuSolver的SVD例程(gesvdj或gesvd)内部失败,导致迭代求解器中的缓冲区溢出或除零错误。

Official Documentation

https://docs.nvidia.com/cuda/cusolver/index.html

Workarounds

  1. 85% success Preprocess the matrix to remove exact singularities: add a small regularization term (e.g., A += 1e-8 * torch.eye(n, device=A.device)) before calling torch.linalg.svd. Example: A_reg = A + 1e-8 * torch.eye(A.size(0), device=A.device); U, S, V = torch.linalg.svd(A_reg).
    Preprocess the matrix to remove exact singularities: add a small regularization term (e.g., A += 1e-8 * torch.eye(n, device=A.device)) before calling torch.linalg.svd. Example: A_reg = A + 1e-8 * torch.eye(A.size(0), device=A.device); U, S, V = torch.linalg.svd(A_reg).
  2. 78% success Use torch.linalg.lstsq instead of SVD for solving least-squares problems, as it handles singular matrices more robustly.
    Use torch.linalg.lstsq instead of SVD for solving least-squares problems, as it handles singular matrices more robustly.

中文步骤

  1. Preprocess the matrix to remove exact singularities: add a small regularization term (e.g., A += 1e-8 * torch.eye(n, device=A.device)) before calling torch.linalg.svd. Example: A_reg = A + 1e-8 * torch.eye(A.size(0), device=A.device); U, S, V = torch.linalg.svd(A_reg).
  2. Use torch.linalg.lstsq instead of SVD for solving least-squares problems, as it handles singular matrices more robustly.

Dead Ends

Common approaches that don't work:

  1. 60% fail

    This works but defeats the purpose of GPU acceleration; also, the error may still occur on CPU if the matrix is singular.

  2. 85% fail

    Singular matrices remain singular regardless of precision; the error is algorithmic, not numerical.