# RuntimeError：[torch.distributed] 对等连接关闭：无法连接到rank 2的10.0.0.5:29500（超时=30）

- **ID:** `pytorch/distributed-timeout-connection`
- **领域:** pytorch
- **类别:** network_error
- **验证级别:** ai_generated
- **修复率:** 75%

## 根因

分布式训练进程（rank 2）在30秒超时窗口内因网络分区、防火墙或进程崩溃而无法初始化或变得不可达。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| torch>=1.10.0 | active | — | — |
| torch>=2.0.0 | active | — | — |

## 解决方案

1. ```
   Verify all ranks are started with same world_size and master_addr: torchrun --nproc_per_node=4 --master_port=29500 train.py
   ```
2. ```
   Check firewall rules and ensure port 29500 is open: nc -zv 10.0.0.5 29500
   ```
3. ```
   Use environment variable MASTER_ADDR and MASTER_PORT consistently across all nodes
   ```

## 无效尝试

- **** — If the rank process is dead or network is broken, longer timeout just delays failure. (60% 失败率)
- **** — Debug logs help diagnose but don't fix underlying connectivity issues. (85% 失败率)
