cudaErrorMpsMaxPartitionSizeExceeded cuda resource_error ai_generated true

CUDA error: MPS server: maximum partition size exceeded (cudaErrorMpsMaxPartitionSizeExceeded)

ID: cuda/mps-max-partition-size-exceeded

Also available as: JSON · Markdown · 中文
75%Fix Rate
85%Confidence
1Evidence
2024-03-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
CUDA 11.8 active
CUDA 12.1 active
CUDA 12.3 active

Root Cause

The CUDA Multi-Process Service (MPS) server has reached its configured maximum partition size, preventing new client connections or memory allocations.

generic

中文

CUDA 多进程服务 (MPS) 服务器已达到其配置的最大分区大小,阻止了新客户端连接或内存分配。

Official Documentation

https://docs.nvidia.com/deploy/mps/index.html#topic_5_3

Workarounds

  1. 80% success Restart the MPS daemon with a larger partition size (e.g., 40GB) using nvidia-cuda-mps-control. Log in as root and run: echo 'set_default_active_thread_percentage 100' | nvidia-cuda-mps-control; echo 'set_default_partition_size 40000MB' | nvidia-cuda-mps-control; then restart client processes.
    Restart the MPS daemon with a larger partition size (e.g., 40GB) using nvidia-cuda-mps-control. Log in as root and run: echo 'set_default_active_thread_percentage 100' | nvidia-cuda-mps-control; echo 'set_default_partition_size 40000MB' | nvidia-cuda-mps-control; then restart client processes.
  2. 75% success Increase the partition size via environment variable before starting the MPS server: export CUDA_MPS_PARTITION_SIZE=40000 (in MB), then restart the MPS daemon with 'nvidia-cuda-mps-control -d'.
    Increase the partition size via environment variable before starting the MPS server: export CUDA_MPS_PARTITION_SIZE=40000 (in MB), then restart the MPS daemon with 'nvidia-cuda-mps-control -d'.

中文步骤

  1. Restart the MPS daemon with a larger partition size (e.g., 40GB) using nvidia-cuda-mps-control. Log in as root and run: echo 'set_default_active_thread_percentage 100' | nvidia-cuda-mps-control; echo 'set_default_partition_size 40000MB' | nvidia-cuda-mps-control; then restart client processes.
  2. Increase the partition size via environment variable before starting the MPS server: export CUDA_MPS_PARTITION_SIZE=40000 (in MB), then restart the MPS daemon with 'nvidia-cuda-mps-control -d'.

Dead Ends

Common approaches that don't work:

  1. 70% fail

    Rebooting the node resets MPS but loses all running jobs and doesn't fix the underlying configuration issue.

  2. 90% fail

    Setting CUDA_MPS_PIPE_DIRECTORY to a temp path without restarting the MPS daemon has no effect on the partition size limit.