# torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total capacity of 79.15 GiB of which 2.00 GiB is free. Including non-blocking allocations, current allocated: 77.15 GiB.

- **ID:** `llm/vllm-cuda-oom-batch`
- **Domain:** llm
- **Category:** resource_error
- **Verification:** ai_generated
- **Fix Rate:** 85%

## Root Cause

vLLM's dynamic batching allocates KV cache blocks per request, and under high concurrency or long sequences, the cumulative allocation exceeds GPU memory, even though the model weights fit.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| vLLM 0.4.0 | active | — | — |
| PyTorch 2.2.0 | active | — | — |
| CUDA 12.1 | active | — | — |
| A100 80GB | active | — | — |
| H100 80GB | active | — | — |

## Workarounds

1. **Reduce `max_num_seqs` in vLLM config (e.g., from 256 to 64) to limit concurrent requests. In code: `LLM(model='meta-llama/Llama-2-7b-hf', max_num_seqs=64)`.** (90% success)
   ```
   Reduce `max_num_seqs` in vLLM config (e.g., from 256 to 64) to limit concurrent requests. In code: `LLM(model='meta-llama/Llama-2-7b-hf', max_num_seqs=64)`.
   ```
2. **Decrease `max_model_len` to limit sequence length, e.g., `LLM(model='...', max_model_len=4096)`. This reduces KV cache size per sequence.** (85% success)
   ```
   Decrease `max_model_len` to limit sequence length, e.g., `LLM(model='...', max_model_len=4096)`. This reduces KV cache size per sequence.
   ```
3. **Enable `enable_prefix_caching=True` in vLLM to reuse KV cache blocks for common prefixes, reducing memory usage for repeated prompts.** (80% success)
   ```
   Enable `enable_prefix_caching=True` in vLLM to reuse KV cache blocks for common prefixes, reducing memory usage for repeated prompts.
   ```

## Dead Ends

- **** — The OOM is from KV cache allocation, not model weights. Even a 7B model can OOM with very long sequences or high concurrency. (50% fail)
- **** — vLLM manages its own memory pool and does not release KV cache blocks to PyTorch's cache; empty_cache has no effect on vLLM allocations. (90% fail)
- **** — vLLM uses tensor parallelism across GPUs, but KV cache is still per-GPU; adding GPUs without adjusting max_num_seqs or max_model_len may still OOM on each GPU. (70% fail)
