huggingface runtime_error ai_generated partial

RuntimeError: failed to save/load training checkpoint. Could not write file to /path/to/checkpoint

ID: huggingface/training-checkpoint-save-failed

Also available as: JSON · Markdown · 中文

80%Fix Rate

81%Confidence

1Evidence

2023-12-01First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
transformers>=4.28.0	active	—	—	—
torch>=1.13.0	active	—	—	—
python>=3.8	active	—	—	—

Insufficient disk space, permission issues, or network file system (NFS) problems preventing checkpoint file writes during training.

generic

磁盘空间不足、权限问题或网络文件系统 (NFS) 问题，阻止训练期间写入检查点文件。

85% success Check disk space with df -h and free space by clearing cache or moving to a different directory. Example: TrainingArguments(output_dir='/new/path/with/space')
```
Check disk space with df -h and free space by clearing cache or moving to a different directory. Example: TrainingArguments(output_dir='/new/path/with/space')
```
75% success Set save_steps to a higher value and use save_only_model=True to reduce checkpoint size: TrainingArguments(save_steps=500, save_only_model=True)
```
Set save_steps to a higher value and use save_only_model=True to reduce checkpoint size: TrainingArguments(save_steps=500, save_only_model=True)
```

使用 df -h 检查磁盘空间，通过清理缓存或移动到不同目录来释放空间。示例：TrainingArguments(output_dir='/new/path/with/space')

将 save_steps 设置为更高值，并使用 save_only_model=True 减少检查点大小：TrainingArguments(save_steps=500, save_only_model=True)

Common approaches that don't work:

70% fail
If disk is full or permissions are wrong, even a single save will fail.
60% fail
Reducing total checkpoints does not fix write failures; the error occurs during writing itself.