cudaErrorMpsMaxPartitionSizeExceeded
cuda
resource_error
ai_generated
true
CUDA 错误:MPS 服务器:超出最大分区大小 (cudaErrorMpsMaxPartitionSizeExceeded)
CUDA error: MPS server: maximum partition size exceeded (cudaErrorMpsMaxPartitionSizeExceeded)
ID: cuda/mps-max-partition-size-exceeded
75%修复率
85%置信度
1证据数
2024-03-15首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| CUDA 11.8 | active | — | — | — |
| CUDA 12.1 | active | — | — | — |
| CUDA 12.3 | active | — | — | — |
根因分析
CUDA 多进程服务 (MPS) 服务器已达到其配置的最大分区大小,阻止了新客户端连接或内存分配。
English
The CUDA Multi-Process Service (MPS) server has reached its configured maximum partition size, preventing new client connections or memory allocations.
官方文档
https://docs.nvidia.com/deploy/mps/index.html#topic_5_3解决方案
-
Restart the MPS daemon with a larger partition size (e.g., 40GB) using nvidia-cuda-mps-control. Log in as root and run: echo 'set_default_active_thread_percentage 100' | nvidia-cuda-mps-control; echo 'set_default_partition_size 40000MB' | nvidia-cuda-mps-control; then restart client processes.
-
Increase the partition size via environment variable before starting the MPS server: export CUDA_MPS_PARTITION_SIZE=40000 (in MB), then restart the MPS daemon with 'nvidia-cuda-mps-control -d'.
无效尝试
常见但无效的做法:
-
70% 失败
Rebooting the node resets MPS but loses all running jobs and doesn't fix the underlying configuration issue.
-
90% 失败
Setting CUDA_MPS_PIPE_DIRECTORY to a temp path without restarting the MPS daemon has no effect on the partition size limit.