huggingface
runtime_error
ai_generated
partial
RuntimeError: failed to save/load training checkpoint. Could not write file to /path/to/checkpoint
ID: huggingface/training-checkpoint-save-failed
80%Fix Rate
81%Confidence
1Evidence
2023-12-01First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| transformers>=4.28.0 | active | — | — | — |
| torch>=1.13.0 | active | — | — | — |
| python>=3.8 | active | — | — | — |
Root Cause
Insufficient disk space, permission issues, or network file system (NFS) problems preventing checkpoint file writes during training.
generic中文
磁盘空间不足、权限问题或网络文件系统 (NFS) 问题,阻止训练期间写入检查点文件。
Official Documentation
https://huggingface.co/docs/transformers/main/en/main_classes/trainer#checkpointingWorkarounds
-
85% success Check disk space with df -h and free space by clearing cache or moving to a different directory. Example: TrainingArguments(output_dir='/new/path/with/space')
Check disk space with df -h and free space by clearing cache or moving to a different directory. Example: TrainingArguments(output_dir='/new/path/with/space')
-
75% success Set save_steps to a higher value and use save_only_model=True to reduce checkpoint size: TrainingArguments(save_steps=500, save_only_model=True)
Set save_steps to a higher value and use save_only_model=True to reduce checkpoint size: TrainingArguments(save_steps=500, save_only_model=True)
中文步骤
使用 df -h 检查磁盘空间,通过清理缓存或移动到不同目录来释放空间。示例:TrainingArguments(output_dir='/new/path/with/space')
将 save_steps 设置为更高值,并使用 save_only_model=True 减少检查点大小:TrainingArguments(save_steps=500, save_only_model=True)
Dead Ends
Common approaches that don't work:
-
70% fail
If disk is full or permissions are wrong, even a single save will fail.
-
60% fail
Reducing total checkpoints does not fix write failures; the error occurs during writing itself.