# RuntimeError: [torch.distributed] Barrier timeout after 600000 ms

- **ID:** `pytorch/distributed-barrier-timeout`
- **Domain:** pytorch
- **Category:** network_error
- **Verification:** ai_generated
- **Fix Rate:** 80%

## Root Cause

A rank in a distributed training setup failed to reach the barrier within the timeout, usually due to a deadlock, a slow rank, or a network partition.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| torch>=1.13.0 | active | — | — |
| NCCL>=2.14 | active | — | — |

## Workarounds

1. **Check for uneven data loading across ranks. Use torch.utils.data.DistributedSampler with drop_last=True to ensure all ranks have the same number of batches.** (80% success)
   ```
   Check for uneven data loading across ranks. Use torch.utils.data.DistributedSampler with drop_last=True to ensure all ranks have the same number of batches.
   ```
2. **Increase the NCCL timeout and add logging to identify which rank is slow.** (70% success)
   ```
   Increase the NCCL timeout and add logging to identify which rank is slow.
   ```

## Dead Ends

- **** — Simply increasing the timeout does not fix the root cause, such as a deadlock or imbalanced workload. (80% fail)
- **** — Killing and restarting all processes without addressing the imbalance or network issue will result in the same error. (90% fail)
