pytorch network_error ai_generated partial

RuntimeError: [torch.distributed] Connection closed by peer: cannot connect to rank 2 at 10.0.0.5:29500 (timeout=30)

ID: pytorch/distributed-timeout-connection

Also available as: JSON · Markdown · 中文

75%Fix Rate

87%Confidence

1Evidence

2023-06-10First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
torch>=1.10.0	active	—	—	—
torch>=2.0.0	active	—	—	—

Root Cause

Distributed training process (rank 2) failed to initialize or became unreachable due to network partition, firewall, or process crash within the 30-second timeout window.

generic

中文

分布式训练进程（rank 2）在30秒超时窗口内因网络分区、防火墙或进程崩溃而无法初始化或变得不可达。

Official Documentation

https://pytorch.org/docs/stable/distributed.html#troubleshooting

Workarounds

85% success Verify all ranks are started with same world_size and master_addr: torchrun --nproc_per_node=4 --master_port=29500 train.py
```
Verify all ranks are started with same world_size and master_addr: torchrun --nproc_per_node=4 --master_port=29500 train.py
```
90% success Check firewall rules and ensure port 29500 is open: nc -zv 10.0.0.5 29500
```
Check firewall rules and ensure port 29500 is open: nc -zv 10.0.0.5 29500
```
80% success Use environment variable MASTER_ADDR and MASTER_PORT consistently across all nodes
```
Use environment variable MASTER_ADDR and MASTER_PORT consistently across all nodes
```

中文步骤

Verify all ranks are started with same world_size and master_addr: torchrun --nproc_per_node=4 --master_port=29500 train.py

Check firewall rules and ensure port 29500 is open: nc -zv 10.0.0.5 29500

Use environment variable MASTER_ADDR and MASTER_PORT consistently across all nodes

Dead Ends

Common approaches that don't work:

60% fail
If the rank process is dead or network is broken, longer timeout just delays failure.
85% fail
Debug logs help diagnose but don't fix underlying connectivity issues.