cudaErrorMpsClientFailed (803) cuda runtime_error ai_generated true

CUDA error: MPS client failed to connect (cudaErrorMpsClientFailed)

ID: cuda/cuda-error-mps-client-failed

Also available as: JSON · Markdown · 中文
82%Fix Rate
88%Confidence
1Evidence
2024-06-10First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
CUDA 12.0 active
CUDA 12.1 active
CUDA 12.3 active
NVIDIA Driver 535.129.03 active
NVIDIA Driver 545.23.06 active

Root Cause

The CUDA Multi-Process Service (MPS) control daemon is not running or has crashed, preventing a new MPS client from connecting to the shared GPU context.

generic

中文

CUDA 多进程服务 (MPS) 控制守护进程未运行或已崩溃,导致新的 MPS 客户端无法连接到共享 GPU 上下文。

Official Documentation

https://docs.nvidia.com/deploy/mps/index.html

Workarounds

  1. 90% success Start the MPS control daemon before launching the application: `export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps; nvidia-cuda-mps-control -d`
    Start the MPS control daemon before launching the application: `export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps; nvidia-cuda-mps-control -d`
  2. 95% success Disable MPS by unsetting CUDA_MPS_PIPE_DIRECTORY and restarting the process: `unset CUDA_MPS_PIPE_DIRECTORY`
    Disable MPS by unsetting CUDA_MPS_PIPE_DIRECTORY and restarting the process: `unset CUDA_MPS_PIPE_DIRECTORY`
  3. 85% success Check if MPS daemon is running and restart it: `ps aux | grep nvidia-cuda-mps-control; killall nvidia-cuda-mps-control; nvidia-cuda-mps-control -d`
    Check if MPS daemon is running and restart it: `ps aux | grep nvidia-cuda-mps-control; killall nvidia-cuda-mps-control; nvidia-cuda-mps-control -d`

中文步骤

  1. Start the MPS control daemon before launching the application: `export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps; nvidia-cuda-mps-control -d`
  2. Disable MPS by unsetting CUDA_MPS_PIPE_DIRECTORY and restarting the process: `unset CUDA_MPS_PIPE_DIRECTORY`
  3. Check if MPS daemon is running and restart it: `ps aux | grep nvidia-cuda-mps-control; killall nvidia-cuda-mps-control; nvidia-cuda-mps-control -d`

Dead Ends

Common approaches that don't work:

  1. 80% fail

    The error is not caused by missing or corrupt CUDA installations, but by a missing or unresponsive MPS daemon process.

  2. 70% fail

    The environment variable only changes the socket path; if the daemon is not running at that path, the connection still fails.