pytorch
network_error
ai_generated
partial
RuntimeError: [torch.distributed] Connection closed by peer: cannot connect to rank 2 at 10.0.0.5:29500 (timeout=30)
ID: pytorch/distributed-timeout-connection
75%Fix Rate
87%Confidence
1Evidence
2023-06-10First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| torch>=1.10.0 | active | — | — | — |
| torch>=2.0.0 | active | — | — | — |
Root Cause
Distributed training process (rank 2) failed to initialize or became unreachable due to network partition, firewall, or process crash within the 30-second timeout window.
generic中文
分布式训练进程(rank 2)在30秒超时窗口内因网络分区、防火墙或进程崩溃而无法初始化或变得不可达。
Official Documentation
https://pytorch.org/docs/stable/distributed.html#troubleshootingWorkarounds
-
85% success Verify all ranks are started with same world_size and master_addr: torchrun --nproc_per_node=4 --master_port=29500 train.py
Verify all ranks are started with same world_size and master_addr: torchrun --nproc_per_node=4 --master_port=29500 train.py
-
90% success Check firewall rules and ensure port 29500 is open: nc -zv 10.0.0.5 29500
Check firewall rules and ensure port 29500 is open: nc -zv 10.0.0.5 29500
-
80% success Use environment variable MASTER_ADDR and MASTER_PORT consistently across all nodes
Use environment variable MASTER_ADDR and MASTER_PORT consistently across all nodes
中文步骤
Verify all ranks are started with same world_size and master_addr: torchrun --nproc_per_node=4 --master_port=29500 train.py
Check firewall rules and ensure port 29500 is open: nc -zv 10.0.0.5 29500
Use environment variable MASTER_ADDR and MASTER_PORT consistently across all nodes
Dead Ends
Common approaches that don't work:
-
60% fail
If the rank process is dead or network is broken, longer timeout just delays failure.
-
85% fail
Debug logs help diagnose but don't fix underlying connectivity issues.