huggingface
runtime_error
ai_generated
partial
RuntimeError:保存/加载训练检查点失败。无法将文件写入 /path/to/checkpoint
RuntimeError: failed to save/load training checkpoint. Could not write file to /path/to/checkpoint
ID: huggingface/training-checkpoint-save-failed
80%修复率
81%置信度
1证据数
2023-12-01首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| transformers>=4.28.0 | active | — | — | — |
| torch>=1.13.0 | active | — | — | — |
| python>=3.8 | active | — | — | — |
根因分析
磁盘空间不足、权限问题或网络文件系统 (NFS) 问题,阻止训练期间写入检查点文件。
English
Insufficient disk space, permission issues, or network file system (NFS) problems preventing checkpoint file writes during training.
官方文档
https://huggingface.co/docs/transformers/main/en/main_classes/trainer#checkpointing解决方案
-
使用 df -h 检查磁盘空间,通过清理缓存或移动到不同目录来释放空间。示例:TrainingArguments(output_dir='/new/path/with/space')
-
将 save_steps 设置为更高值,并使用 save_only_model=True 减少检查点大小:TrainingArguments(save_steps=500, save_only_model=True)
无效尝试
常见但无效的做法:
-
70% 失败
If disk is full or permissions are wrong, even a single save will fail.
-
60% 失败
Reducing total checkpoints does not fix write failures; the error occurs during writing itself.