CUDA-OOM-001 llm resource_error ai_generated true

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has 8.00 GiB total capacity; 7.80 GiB already allocated.

ID: llm/huggingface-model-load-oom-on-cpu

Also available as: JSON · Markdown · 中文

88%Fix Rate

90%Confidence

1Evidence

2023-12-01First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
transformers==4.36.0	active	—	—	—
torch==2.1.0	active	—	—	—
accelerate==0.25.0	active	—	—	—

Root Cause

Hugging Face model loading tries to allocate the full model on GPU, but the available VRAM is insufficient due to other processes (e.g., previous model instances, data loaders) consuming memory, or the model itself is too large for the GPU.

generic

中文

Hugging Face 模型加载尝试在 GPU 上分配完整模型，但由于其他进程（例如，先前的模型实例、数据加载器）消耗了内存，或者模型本身对于 GPU 来说太大，导致可用 VRAM 不足。

Official Documentation

https://huggingface.co/docs/transformers/en/troubleshooting#out-of-memory

Workarounds

95% success Load the model with device_map='auto' and offload to CPU or disk: `model = AutoModelForCausalLM.from_pretrained('model-name', device_map='auto', torch_dtype=torch.float16, offload_folder='/tmp/offload')`. This splits the model across GPU, CPU, and disk if needed.
```
Load the model with device_map='auto' and offload to CPU or disk: `model = AutoModelForCausalLM.from_pretrained('model-name', device_map='auto', torch_dtype=torch.float16, offload_folder='/tmp/offload')`. This splits the model across GPU, CPU, and disk if needed.
```
85% success Use gradient checkpointing to reduce memory during training: `model.gradient_checkpointing_enable()` before training, which trades compute for memory by recomputing activations.
```
Use gradient checkpointing to reduce memory during training: `model.gradient_checkpointing_enable()` before training, which trades compute for memory by recomputing activations.
```
75% success Explicitly clear GPU memory before loading: `import gc; gc.collect(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()` and then load the model with `low_cpu_mem_usage=True`.
```
Explicitly clear GPU memory before loading: `import gc; gc.collect(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()` and then load the model with `low_cpu_mem_usage=True`.
```

中文步骤

使用 device_map='auto' 加载模型并卸载到 CPU 或磁盘：`model = AutoModelForCausalLM.from_pretrained('model-name', device_map='auto', torch_dtype=torch.float16, offload_folder='/tmp/offload')`。这会根据需要将模型拆分到 GPU、CPU 和磁盘。

在训练前使用梯度检查点以减少内存：`model.gradient_checkpointing_enable()`，通过重新计算激活值来用计算换取内存。

在加载模型前显式清除 GPU 内存：`import gc; gc.collect(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()`，然后使用 `low_cpu_mem_usage=True` 加载模型。

Dead Ends

Common approaches that don't work:

60% fail
While this frees GPU memory from the current session, it doesn't prevent the underlying memory fragmentation or model size issue. The error returns if the model is loaded again without adjustments.
80% fail
empty_cache() only releases unused cached memory allocator blocks, not memory actively held by other tensors. It often has minimal effect when VRAM is fully consumed by model parameters.
95% fail
The OOM occurs during model loading, not inference. Batch size doesn't affect model parameter memory allocation.