huggingface runtime_error ai_generated partial

RuntimeError:保存/加载训练检查点失败。无法将文件写入 /path/to/checkpoint

RuntimeError: failed to save/load training checkpoint. Could not write file to /path/to/checkpoint

ID: huggingface/training-checkpoint-save-failed

其他格式: JSON · Markdown 中文 · English
80%修复率
81%置信度
1证据数
2023-12-01首次发现

版本兼容性

版本状态引入弃用备注
transformers>=4.28.0 active
torch>=1.13.0 active
python>=3.8 active

根因分析

磁盘空间不足、权限问题或网络文件系统 (NFS) 问题,阻止训练期间写入检查点文件。

English

Insufficient disk space, permission issues, or network file system (NFS) problems preventing checkpoint file writes during training.

generic

官方文档

https://huggingface.co/docs/transformers/main/en/main_classes/trainer#checkpointing

解决方案

  1. 使用 df -h 检查磁盘空间,通过清理缓存或移动到不同目录来释放空间。示例:TrainingArguments(output_dir='/new/path/with/space')
  2. 将 save_steps 设置为更高值,并使用 save_only_model=True 减少检查点大小:TrainingArguments(save_steps=500, save_only_model=True)

无效尝试

常见但无效的做法:

  1. 70% 失败

    If disk is full or permissions are wrong, even a single save will fail.

  2. 60% 失败

    Reducing total checkpoints does not fix write failures; the error occurs during writing itself.