torch.cuda.OutOfMemoryError:CUDA 内存不足。尝试分配 2.00 GiB。GPU 0 总容量为 79.15 GiB,其中 2.00 GiB 空闲。包括非阻塞分配,当前已分配:77.15 GiB。
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total capacity of 79.15 GiB of which 2.00 GiB is free. Including non-blocking allocations, current allocated: 77.15 GiB.
ID: llm/vllm-cuda-oom-batch
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| vLLM 0.4.0 | active | — | — | — |
| PyTorch 2.2.0 | active | — | — | — |
| CUDA 12.1 | active | — | — | — |
| A100 80GB | active | — | — | — |
| H100 80GB | active | — | — | — |
根因分析
vLLM 的动态批处理按请求分配 KV 缓存块,在高并发或长序列下,累积分配超过 GPU 内存,即使模型权重本身可以容纳。
English
vLLM's dynamic batching allocates KV cache blocks per request, and under high concurrency or long sequences, the cumulative allocation exceeds GPU memory, even though the model weights fit.
官方文档
https://vllm.readthedocs.io/en/latest/performance/optimization.html#memory-management解决方案
-
在 vLLM 配置中减少 `max_num_seqs`(例如从 256 减到 64)以限制并发请求。代码示例:`LLM(model='meta-llama/Llama-2-7b-hf', max_num_seqs=64)`。
-
减小 `max_model_len` 以限制序列长度,例如 `LLM(model='...', max_model_len=4096)`。这会减少每个序列的 KV 缓存大小。
-
在 vLLM 中启用 `enable_prefix_caching=True` 以重用常见前缀的 KV 缓存块,从而减少重复提示的内存使用。
无效尝试
常见但无效的做法:
-
50% 失败
The OOM is from KV cache allocation, not model weights. Even a 7B model can OOM with very long sequences or high concurrency.
-
90% 失败
vLLM manages its own memory pool and does not release KV cache blocks to PyTorch's cache; empty_cache has no effect on vLLM allocations.
-
70% 失败
vLLM uses tensor parallelism across GPUs, but KV cache is still per-GPU; adding GPUs without adjusting max_num_seqs or max_model_len may still OOM on each GPU.