llm resource_error ai_generated true

torch.cuda.OutOfMemoryError：CUDA 内存不足。尝试分配 2.00 GiB。GPU 0 总容量为 79.15 GiB，其中 2.00 GiB 空闲。包括非阻塞分配，当前已分配：77.15 GiB。

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total capacity of 79.15 GiB of which 2.00 GiB is free. Including non-blocking allocations, current allocated: 77.15 GiB.

ID: llm/vllm-cuda-oom-batch

其他格式: JSON · Markdown 中文 · English

85%修复率

88%置信度

1证据数

2024-06-20首次发现

版本兼容性

版本	状态	引入	弃用	备注
vLLM 0.4.0	active	—	—	—
PyTorch 2.2.0	active	—	—	—
CUDA 12.1	active	—	—	—
A100 80GB	active	—	—	—
H100 80GB	active	—	—	—

根因分析

vLLM 的动态批处理按请求分配 KV 缓存块，在高并发或长序列下，累积分配超过 GPU 内存，即使模型权重本身可以容纳。

English

vLLM's dynamic batching allocates KV cache blocks per request, and under high concurrency or long sequences, the cumulative allocation exceeds GPU memory, even though the model weights fit.

generic

官方文档

https://vllm.readthedocs.io/en/latest/performance/optimization.html#memory-management

解决方案

在 vLLM 配置中减少 `max_num_seqs`（例如从 256 减到 64）以限制并发请求。代码示例：`LLM(model='meta-llama/Llama-2-7b-hf', max_num_seqs=64)`。

减小 `max_model_len` 以限制序列长度，例如 `LLM(model='...', max_model_len=4096)`。这会减少每个序列的 KV 缓存大小。

在 vLLM 中启用 `enable_prefix_caching=True` 以重用常见前缀的 KV 缓存块，从而减少重复提示的内存使用。

无效尝试

常见但无效的做法:

50% 失败
The OOM is from KV cache allocation, not model weights. Even a 7B model can OOM with very long sequences or high concurrency.
90% 失败
vLLM manages its own memory pool and does not release KV cache blocks to PyTorch's cache; empty_cache has no effect on vLLM allocations.
70% 失败
vLLM uses tensor parallelism across GPUs, but KV cache is still per-GPU; adding GPUs without adjusting max_num_seqs or max_model_len may still OOM on each GPU.