llm resource_error ai_generated true

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total capacity of 79.15 GiB of which 2.00 GiB is free. Including non-blocking allocations, current allocated: 77.15 GiB.

ID: llm/vllm-cuda-oom-batch

Also available as: JSON · Markdown · 中文

85%Fix Rate

88%Confidence

1Evidence

2024-06-20First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
vLLM 0.4.0	active	—	—	—
PyTorch 2.2.0	active	—	—	—
CUDA 12.1	active	—	—	—
A100 80GB	active	—	—	—
H100 80GB	active	—	—	—

Root Cause

vLLM's dynamic batching allocates KV cache blocks per request, and under high concurrency or long sequences, the cumulative allocation exceeds GPU memory, even though the model weights fit.

generic

中文

vLLM 的动态批处理按请求分配 KV 缓存块，在高并发或长序列下，累积分配超过 GPU 内存，即使模型权重本身可以容纳。

Official Documentation

https://vllm.readthedocs.io/en/latest/performance/optimization.html#memory-management

Workarounds

90% success Reduce `max_num_seqs` in vLLM config (e.g., from 256 to 64) to limit concurrent requests. In code: `LLM(model='meta-llama/Llama-2-7b-hf', max_num_seqs=64)`.
```
Reduce `max_num_seqs` in vLLM config (e.g., from 256 to 64) to limit concurrent requests. In code: `LLM(model='meta-llama/Llama-2-7b-hf', max_num_seqs=64)`.
```
85% success Decrease `max_model_len` to limit sequence length, e.g., `LLM(model='...', max_model_len=4096)`. This reduces KV cache size per sequence.
```
Decrease `max_model_len` to limit sequence length, e.g., `LLM(model='...', max_model_len=4096)`. This reduces KV cache size per sequence.
```
80% success Enable `enable_prefix_caching=True` in vLLM to reuse KV cache blocks for common prefixes, reducing memory usage for repeated prompts.
```
Enable `enable_prefix_caching=True` in vLLM to reuse KV cache blocks for common prefixes, reducing memory usage for repeated prompts.
```

中文步骤

在 vLLM 配置中减少 `max_num_seqs`（例如从 256 减到 64）以限制并发请求。代码示例：`LLM(model='meta-llama/Llama-2-7b-hf', max_num_seqs=64)`。

减小 `max_model_len` 以限制序列长度，例如 `LLM(model='...', max_model_len=4096)`。这会减少每个序列的 KV 缓存大小。

在 vLLM 中启用 `enable_prefix_caching=True` 以重用常见前缀的 KV 缓存块，从而减少重复提示的内存使用。

Dead Ends

Common approaches that don't work:

50% fail
The OOM is from KV cache allocation, not model weights. Even a 7B model can OOM with very long sequences or high concurrency.
90% fail
vLLM manages its own memory pool and does not release KV cache blocks to PyTorch's cache; empty_cache has no effect on vLLM allocations.
70% fail
vLLM uses tensor parallelism across GPUs, but KV cache is still per-GPU; adding GPUs without adjusting max_num_seqs or max_model_len may still OOM on each GPU.