cudaErrorMpsMaxPartitionSizeExceeded cuda resource_error ai_generated true

CUDA 错误：MPS 服务器：超出最大分区大小 (cudaErrorMpsMaxPartitionSizeExceeded)

CUDA error: MPS server: maximum partition size exceeded (cudaErrorMpsMaxPartitionSizeExceeded)

ID: cuda/mps-max-partition-size-exceeded

其他格式: JSON · Markdown 中文 · English

75%修复率

85%置信度

1证据数

2024-03-15首次发现

版本兼容性

版本	状态	引入	弃用	备注
CUDA 11.8	active	—	—	—
CUDA 12.1	active	—	—	—
CUDA 12.3	active	—	—	—

根因分析

CUDA 多进程服务 (MPS) 服务器已达到其配置的最大分区大小，阻止了新客户端连接或内存分配。

English

The CUDA Multi-Process Service (MPS) server has reached its configured maximum partition size, preventing new client connections or memory allocations.

generic

官方文档

https://docs.nvidia.com/deploy/mps/index.html#topic_5_3

解决方案

Restart the MPS daemon with a larger partition size (e.g., 40GB) using nvidia-cuda-mps-control. Log in as root and run: echo 'set_default_active_thread_percentage 100' | nvidia-cuda-mps-control; echo 'set_default_partition_size 40000MB' | nvidia-cuda-mps-control; then restart client processes.

Increase the partition size via environment variable before starting the MPS server: export CUDA_MPS_PARTITION_SIZE=40000 (in MB), then restart the MPS daemon with 'nvidia-cuda-mps-control -d'.

无效尝试

常见但无效的做法:

70% 失败
Rebooting the node resets MPS but loses all running jobs and doesn't fix the underlying configuration issue.
90% 失败
Setting CUDA_MPS_PIPE_DIRECTORY to a temp path without restarting the MPS daemon has no effect on the partition size limit.