cudaErrorMpsMaxPartitionSizeExceeded
cuda
resource_error
ai_generated
true
CUDA error: MPS server: maximum partition size exceeded (cudaErrorMpsMaxPartitionSizeExceeded)
ID: cuda/mps-max-partition-size-exceeded
75%Fix Rate
85%Confidence
1Evidence
2024-03-15First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| CUDA 11.8 | active | — | — | — |
| CUDA 12.1 | active | — | — | — |
| CUDA 12.3 | active | — | — | — |
Root Cause
The CUDA Multi-Process Service (MPS) server has reached its configured maximum partition size, preventing new client connections or memory allocations.
generic中文
CUDA 多进程服务 (MPS) 服务器已达到其配置的最大分区大小,阻止了新客户端连接或内存分配。
Official Documentation
https://docs.nvidia.com/deploy/mps/index.html#topic_5_3Workarounds
-
80% success Restart the MPS daemon with a larger partition size (e.g., 40GB) using nvidia-cuda-mps-control. Log in as root and run: echo 'set_default_active_thread_percentage 100' | nvidia-cuda-mps-control; echo 'set_default_partition_size 40000MB' | nvidia-cuda-mps-control; then restart client processes.
Restart the MPS daemon with a larger partition size (e.g., 40GB) using nvidia-cuda-mps-control. Log in as root and run: echo 'set_default_active_thread_percentage 100' | nvidia-cuda-mps-control; echo 'set_default_partition_size 40000MB' | nvidia-cuda-mps-control; then restart client processes.
-
75% success Increase the partition size via environment variable before starting the MPS server: export CUDA_MPS_PARTITION_SIZE=40000 (in MB), then restart the MPS daemon with 'nvidia-cuda-mps-control -d'.
Increase the partition size via environment variable before starting the MPS server: export CUDA_MPS_PARTITION_SIZE=40000 (in MB), then restart the MPS daemon with 'nvidia-cuda-mps-control -d'.
中文步骤
Restart the MPS daemon with a larger partition size (e.g., 40GB) using nvidia-cuda-mps-control. Log in as root and run: echo 'set_default_active_thread_percentage 100' | nvidia-cuda-mps-control; echo 'set_default_partition_size 40000MB' | nvidia-cuda-mps-control; then restart client processes.
Increase the partition size via environment variable before starting the MPS server: export CUDA_MPS_PARTITION_SIZE=40000 (in MB), then restart the MPS daemon with 'nvidia-cuda-mps-control -d'.
Dead Ends
Common approaches that don't work:
-
70% fail
Rebooting the node resets MPS but loses all running jobs and doesn't fix the underlying configuration issue.
-
90% fail
Setting CUDA_MPS_PIPE_DIRECTORY to a temp path without restarting the MPS daemon has no effect on the partition size limit.