huggingface config_error ai_generated true

RuntimeError: device_map='auto' is not supported when using Trainer with a model that has been loaded with device_map='auto'. Please set device_map=None or load the model on a single device.

ID: huggingface/device-map-auto-conflict-with-trainer

Also available as: JSON · Markdown · 中文

90%Fix Rate

85%Confidence

1Evidence

2024-02-20First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
transformers 4.42.0	active	—	—	—
accelerate 0.28.0	active	—	—	—
torch 2.2.0	active	—	—	—

Root Cause

Trainer internally manages device placement and conflicts with model parallelism set by `device_map='auto'` from Accelerate, causing a runtime assertion failure.

generic

中文

Trainer 内部管理设备分配，与 Accelerate 的 `device_map='auto'` 设置的模型并行冲突，导致运行时断言失败。

Official Documentation

https://huggingface.co/docs/transformers/en/troubleshooting#device-map-issues

Workarounds

90% success Load the model without device_map: `model = AutoModelForCausalLM.from_pretrained('model-name', device_map=None)` and then pass to Trainer.
```
Load the model without device_map: `model = AutoModelForCausalLM.from_pretrained('model-name', device_map=None)` and then pass to Trainer.
```
85% success Use `accelerate launch` with a config file to manage multi-GPU, and set `device_map=None` in code.
```
Use `accelerate launch` with a config file to manage multi-GPU, and set `device_map=None` in code.
```

中文步骤

Load the model without device_map: `model = AutoModelForCausalLM.from_pretrained('model-name', device_map=None)` and then pass to Trainer.

Use `accelerate launch` with a config file to manage multi-GPU, and set `device_map=None` in code.

Dead Ends

Common approaches that don't work:

100% fail
Trainer does not accept `device_map` parameter; it relies on model's existing device map, causing the same conflict.
80% fail
DataParallel is incompatible with Trainer's internal gradient accumulation and loss scaling, leading to silent accuracy drop or hang.