# RuntimeError: [torch.distributed] Connection closed by peer: cannot connect to rank 2 at 10.0.0.5:29500 (timeout=30)

- **ID:** `pytorch/distributed-timeout-connection`
- **Domain:** pytorch
- **Category:** network_error
- **Verification:** ai_generated
- **Fix Rate:** 75%

## Root Cause

Distributed training process (rank 2) failed to initialize or became unreachable due to network partition, firewall, or process crash within the 30-second timeout window.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| torch>=1.10.0 | active | — | — |
| torch>=2.0.0 | active | — | — |

## Workarounds

1. **Verify all ranks are started with same world_size and master_addr: torchrun --nproc_per_node=4 --master_port=29500 train.py** (85% success)
   ```
   Verify all ranks are started with same world_size and master_addr: torchrun --nproc_per_node=4 --master_port=29500 train.py
   ```
2. **Check firewall rules and ensure port 29500 is open: nc -zv 10.0.0.5 29500** (90% success)
   ```
   Check firewall rules and ensure port 29500 is open: nc -zv 10.0.0.5 29500
   ```
3. **Use environment variable MASTER_ADDR and MASTER_PORT consistently across all nodes** (80% success)
   ```
   Use environment variable MASTER_ADDR and MASTER_PORT consistently across all nodes
   ```

## Dead Ends

- **** — If the rank process is dead or network is broken, longer timeout just delays failure. (60% fail)
- **** — Debug logs help diagnose but don't fix underlying connectivity issues. (85% fail)
