# RuntimeError: failed to save/load training checkpoint. Could not write file to /path/to/checkpoint

- **ID:** `huggingface/training-checkpoint-save-failed`
- **Domain:** huggingface
- **Category:** runtime_error
- **Verification:** ai_generated
- **Fix Rate:** 80%

## Root Cause

Insufficient disk space, permission issues, or network file system (NFS) problems preventing checkpoint file writes during training.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| transformers>=4.28.0 | active | — | — |
| torch>=1.13.0 | active | — | — |
| python>=3.8 | active | — | — |

## Workarounds

1. **Check disk space with df -h and free space by clearing cache or moving to a different directory. Example: TrainingArguments(output_dir='/new/path/with/space')** (85% success)
   ```
   Check disk space with df -h and free space by clearing cache or moving to a different directory. Example: TrainingArguments(output_dir='/new/path/with/space')
   ```
2. **Set save_steps to a higher value and use save_only_model=True to reduce checkpoint size: TrainingArguments(save_steps=500, save_only_model=True)** (75% success)
   ```
   Set save_steps to a higher value and use save_only_model=True to reduce checkpoint size: TrainingArguments(save_steps=500, save_only_model=True)
   ```

## Dead Ends

- **** — If disk is full or permissions are wrong, even a single save will fail. (70% fail)
- **** — Reducing total checkpoints does not fix write failures; the error occurs during writing itself. (60% fail)
