# CUDA 错误：MPS 服务器：超出最大分区大小 (cudaErrorMpsMaxPartitionSizeExceeded)

- **ID:** `cuda/mps-max-partition-size-exceeded`
- **领域:** cuda
- **类别:** resource_error
- **错误码:** `cudaErrorMpsMaxPartitionSizeExceeded`
- **验证级别:** ai_generated
- **修复率:** 75%

## 根因

CUDA 多进程服务 (MPS) 服务器已达到其配置的最大分区大小，阻止了新客户端连接或内存分配。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| CUDA 11.8 | active | — | — |
| CUDA 12.1 | active | — | — |
| CUDA 12.3 | active | — | — |

## 解决方案

1. ```
   Restart the MPS daemon with a larger partition size (e.g., 40GB) using nvidia-cuda-mps-control. Log in as root and run: echo 'set_default_active_thread_percentage 100' | nvidia-cuda-mps-control; echo 'set_default_partition_size 40000MB' | nvidia-cuda-mps-control; then restart client processes.
   ```
2. ```
   Increase the partition size via environment variable before starting the MPS server: export CUDA_MPS_PARTITION_SIZE=40000 (in MB), then restart the MPS daemon with 'nvidia-cuda-mps-control -d'.
   ```

## 无效尝试

- **** — Rebooting the node resets MPS but loses all running jobs and doesn't fix the underlying configuration issue. (70% 失败率)
- **** — Setting CUDA_MPS_PIPE_DIRECTORY to a temp path without restarting the MPS daemon has no effect on the partition size limit. (90% 失败率)
