# tensorflow.python.framework.errors_impl.UnknownError: Failed to save checkpoint to /tmp/model.ckpt: IO error: No space left on device [Op:SaveV2]

- **ID:** `tensorflow/checkpoint-save-failed-io-error`
- **Domain:** tensorflow
- **Category:** resource_error
- **Error Code:** `ESAV`
- **Verification:** ai_generated
- **Fix Rate:** 85%

## Root Cause

The disk partition where the checkpoint directory resides has run out of inodes or blocks, causing the SaveV2 operation to fail.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| 2.12 | active | — | — |
| 2.13 | active | — | — |
| 2.14 | active | — | — |

## Workarounds

1. **Check disk usage with 'df -h' and 'df -i', then delete unnecessary files or expand the partition. Alternatively, change checkpoint path to a partition with more space using tf.train.CheckpointManager with a different directory.** (85% success)
   ```
   Check disk usage with 'df -h' and 'df -i', then delete unnecessary files or expand the partition. Alternatively, change checkpoint path to a partition with more space using tf.train.CheckpointManager with a different directory.
   ```
2. **Enable checkpoint compression by setting options.experimental_io_device='/job:localhost' and using tf.train.CheckpointOptions(experimental_io_device='/job:localhost', experimental_enable_async_checkpoint=True) to reduce immediate disk usage.** (75% success)
   ```
   Enable checkpoint compression by setting options.experimental_io_device='/job:localhost' and using tf.train.CheckpointOptions(experimental_io_device='/job:localhost', experimental_enable_async_checkpoint=True) to reduce immediate disk usage.
   ```

## Dead Ends

- **Delete random files in /tmp to free space** — The checkpoint path may not be in /tmp; also deleting unrelated files can cause other failures. (60% fail)
- **Set TF_CPP_MIN_LOG_LEVEL=2 to suppress the error** — Suppressing logs does not resolve the underlying disk space issue; the checkpoint will still not be saved. (90% fail)
- **Reduce batch size to reduce checkpoint size** — Checkpoint size is determined by model parameters, not batch size; reducing batch size does not free disk space. (80% fail)
