cudaErrorMpsMaxPartitionSizeExceeded cuda resource_error ai_generated true

CUDA error: MPS server: maximum partition size exceeded (cudaErrorMpsMaxPartitionSizeExceeded)

ID: cuda/mps-max-partition-size-exceeded

Also available as: JSON · Markdown · 中文

75%Fix Rate

85%Confidence

1Evidence

2024-03-15First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
CUDA 11.8	active	—	—	—
CUDA 12.1	active	—	—	—
CUDA 12.3	active	—	—	—

Root Cause

The CUDA Multi-Process Service (MPS) server has reached its configured maximum partition size, preventing new client connections or memory allocations.

generic

中文

CUDA 多进程服务 (MPS) 服务器已达到其配置的最大分区大小，阻止了新客户端连接或内存分配。

Official Documentation

https://docs.nvidia.com/deploy/mps/index.html#topic_5_3

Workarounds

80% success Restart the MPS daemon with a larger partition size (e.g., 40GB) using nvidia-cuda-mps-control. Log in as root and run: echo 'set_default_active_thread_percentage 100' | nvidia-cuda-mps-control; echo 'set_default_partition_size 40000MB' | nvidia-cuda-mps-control; then restart client processes.
```
Restart the MPS daemon with a larger partition size (e.g., 40GB) using nvidia-cuda-mps-control. Log in as root and run: echo 'set_default_active_thread_percentage 100' | nvidia-cuda-mps-control; echo 'set_default_partition_size 40000MB' | nvidia-cuda-mps-control; then restart client processes.
```
75% success Increase the partition size via environment variable before starting the MPS server: export CUDA_MPS_PARTITION_SIZE=40000 (in MB), then restart the MPS daemon with 'nvidia-cuda-mps-control -d'.
```
Increase the partition size via environment variable before starting the MPS server: export CUDA_MPS_PARTITION_SIZE=40000 (in MB), then restart the MPS daemon with 'nvidia-cuda-mps-control -d'.
```

中文步骤

Restart the MPS daemon with a larger partition size (e.g., 40GB) using nvidia-cuda-mps-control. Log in as root and run: echo 'set_default_active_thread_percentage 100' | nvidia-cuda-mps-control; echo 'set_default_partition_size 40000MB' | nvidia-cuda-mps-control; then restart client processes.

Increase the partition size via environment variable before starting the MPS server: export CUDA_MPS_PARTITION_SIZE=40000 (in MB), then restart the MPS daemon with 'nvidia-cuda-mps-control -d'.

Dead Ends

Common approaches that don't work:

70% fail
Rebooting the node resets MPS but loses all running jobs and doesn't fix the underlying configuration issue.
90% fail
Setting CUDA_MPS_PIPE_DIRECTORY to a temp path without restarting the MPS daemon has no effect on the partition size limit.