CUDA error: MPS heap memory limit exceeded (cudaErrorMpsHeapMemoryLimitExceeded)
ID: cuda/mps-heap-limit-exceeded
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| CUDA 11.8 | active | — | — | — |
| CUDA 12.2 | active | — | — | — |
| MPS 1.0 | active | — | — | — |
| NVIDIA Driver 535.54 | active | — | — | — |
Root Cause
Under NVIDIA Multi-Process Service (MPS), the per-client heap memory limit set by the MPS server (via CUDA_MPS_HEAP_SIZE) has been exhausted by the current process, typically due to allocating too many small tensors or not freeing memory in a long-running training loop.
generic中文
在 NVIDIA 多进程服务 (MPS) 下,MPS 服务器设置的每客户端堆内存限制(通过 CUDA_MPS_HEAP_SIZE)已被当前进程耗尽,通常是由于分配了太多小张量或在长时间训练循环中未释放内存。
Official Documentation
https://docs.nvidia.com/deploy/mps/index.html#topic_3_4_2Workarounds
-
90% success Increase the MPS heap size by setting the environment variable before starting the MPS daemon: `export CUDA_MPS_HEAP_SIZE=4G` (or a larger value like `8G`), then restart MPS with `nvidia-cuda-mps-control -d`. This allocates more heap memory per client.
Increase the MPS heap size by setting the environment variable before starting the MPS daemon: `export CUDA_MPS_HEAP_SIZE=4G` (or a larger value like `8G`), then restart MPS with `nvidia-cuda-mps-control -d`. This allocates more heap memory per client.
-
75% success Reduce memory fragmentation by using `torch.cuda.empty_cache()` periodically in your training loop, or by reusing tensors with `torch.zeros` or `torch.empty` instead of creating new ones each iteration.
Reduce memory fragmentation by using `torch.cuda.empty_cache()` periodically in your training loop, or by reusing tensors with `torch.zeros` or `torch.empty` instead of creating new ones each iteration.
-
95% success Switch from MPS to a single process per GPU (disable MPS) by stopping the MPS daemon: `echo quit | nvidia-cuda-mps-control`. This removes the heap limit entirely but loses MPS's inter-process communication benefits.
Switch from MPS to a single process per GPU (disable MPS) by stopping the MPS daemon: `echo quit | nvidia-cuda-mps-control`. This removes the heap limit entirely but loses MPS's inter-process communication benefits.
中文步骤
在启动 MPS 守护进程前设置环境变量以增加 MPS 堆大小:`export CUDA_MPS_HEAP_SIZE=4G`(或更大的值如 `8G`),然后使用 `nvidia-cuda-mps-control -d` 重启 MPS。这会为每个客户端分配更多堆内存。
通过在训练循环中定期使用 `torch.cuda.empty_cache()` 减少内存碎片,或使用 `torch.zeros` 或 `torch.empty` 重用张量,而不是每次迭代创建新张量。
通过停止 MPS 守护进程从 MPS 切换到每 GPU 单进程(禁用 MPS):`echo quit | nvidia-cuda-mps-control`。这完全移除堆限制,但会失去 MPS 的进程间通信优势。
Dead Ends
Common approaches that don't work:
-
Increasing `torch.cuda.max_memory_allocated` via `torch.cuda.set_per_process_memory_fraction`
90% fail
The MPS heap limit is independent of the per-process memory fraction; changing the PyTorch memory limit does not affect the MPS server's heap allocation.
-
Restarting only the CUDA process without restarting the MPS server
80% fail
The MPS server's heap limit is persistent across client restarts; the limit is still in effect unless the server is restarted.
-
Setting `CUDA_MPS_HEAP_SIZE=0` to disable the limit
75% fail
Setting heap size to 0 may cause undefined behavior or default to a very small limit; the environment variable must be set to a positive value or unset to use the default (which is usually larger).