# torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has 8.00 GiB total capacity; 7.80 GiB already allocated.

- **ID:** `llm/huggingface-model-load-oom-on-cpu`
- **Domain:** llm
- **Category:** resource_error
- **Error Code:** `CUDA-OOM-001`
- **Verification:** ai_generated
- **Fix Rate:** 88%

## Root Cause

Hugging Face model loading tries to allocate the full model on GPU, but the available VRAM is insufficient due to other processes (e.g., previous model instances, data loaders) consuming memory, or the model itself is too large for the GPU.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| transformers==4.36.0 | active | — | — |
| torch==2.1.0 | active | — | — |
| accelerate==0.25.0 | active | — | — |

## Workarounds

1. **Load the model with device_map='auto' and offload to CPU or disk: `model = AutoModelForCausalLM.from_pretrained('model-name', device_map='auto', torch_dtype=torch.float16, offload_folder='/tmp/offload')`. This splits the model across GPU, CPU, and disk if needed.** (95% success)
   ```
   Load the model with device_map='auto' and offload to CPU or disk: `model = AutoModelForCausalLM.from_pretrained('model-name', device_map='auto', torch_dtype=torch.float16, offload_folder='/tmp/offload')`. This splits the model across GPU, CPU, and disk if needed.
   ```
2. **Use gradient checkpointing to reduce memory during training: `model.gradient_checkpointing_enable()` before training, which trades compute for memory by recomputing activations.** (85% success)
   ```
   Use gradient checkpointing to reduce memory during training: `model.gradient_checkpointing_enable()` before training, which trades compute for memory by recomputing activations.
   ```
3. **Explicitly clear GPU memory before loading: `import gc; gc.collect(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()` and then load the model with `low_cpu_mem_usage=True`.** (75% success)
   ```
   Explicitly clear GPU memory before loading: `import gc; gc.collect(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()` and then load the model with `low_cpu_mem_usage=True`.
   ```

## Dead Ends

- **** — While this frees GPU memory from the current session, it doesn't prevent the underlying memory fragmentation or model size issue. The error returns if the model is loaded again without adjustments. (60% fail)
- **** — empty_cache() only releases unused cached memory allocator blocks, not memory actively held by other tensors. It often has minimal effect when VRAM is fully consumed by model parameters. (80% fail)
- **** — The OOM occurs during model loading, not inference. Batch size doesn't affect model parameter memory allocation. (95% fail)