huggingface config_error ai_generated true

RuntimeError: device_map='auto' is not supported when using Trainer with a model that has been loaded with device_map='auto'. Please set device_map=None or load the model on a single device.

ID: huggingface/device-map-auto-conflict-with-trainer

Also available as: JSON · Markdown · 中文
90%Fix Rate
85%Confidence
1Evidence
2024-02-20First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
transformers 4.42.0 active
accelerate 0.28.0 active
torch 2.2.0 active

Root Cause

Trainer internally manages device placement and conflicts with model parallelism set by `device_map='auto'` from Accelerate, causing a runtime assertion failure.

generic

中文

Trainer 内部管理设备分配,与 Accelerate 的 `device_map='auto'` 设置的模型并行冲突,导致运行时断言失败。

Official Documentation

https://huggingface.co/docs/transformers/en/troubleshooting#device-map-issues

Workarounds

  1. 90% success Load the model without device_map: `model = AutoModelForCausalLM.from_pretrained('model-name', device_map=None)` and then pass to Trainer.
    Load the model without device_map: `model = AutoModelForCausalLM.from_pretrained('model-name', device_map=None)` and then pass to Trainer.
  2. 85% success Use `accelerate launch` with a config file to manage multi-GPU, and set `device_map=None` in code.
    Use `accelerate launch` with a config file to manage multi-GPU, and set `device_map=None` in code.

中文步骤

  1. Load the model without device_map: `model = AutoModelForCausalLM.from_pretrained('model-name', device_map=None)` and then pass to Trainer.
  2. Use `accelerate launch` with a config file to manage multi-GPU, and set `device_map=None` in code.

Dead Ends

Common approaches that don't work:

  1. 100% fail

    Trainer does not accept `device_map` parameter; it relies on model's existing device map, causing the same conflict.

  2. 80% fail

    DataParallel is incompatible with Trainer's internal gradient accumulation and loss scaling, leading to silent accuracy drop or hang.