pytorch
network_error
ai_generated
partial
RuntimeError:[torch.distributed] 对等连接关闭:无法连接到rank 2的10.0.0.5:29500(超时=30)
RuntimeError: [torch.distributed] Connection closed by peer: cannot connect to rank 2 at 10.0.0.5:29500 (timeout=30)
ID: pytorch/distributed-timeout-connection
75%修复率
87%置信度
1证据数
2023-06-10首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| torch>=1.10.0 | active | — | — | — |
| torch>=2.0.0 | active | — | — | — |
根因分析
分布式训练进程(rank 2)在30秒超时窗口内因网络分区、防火墙或进程崩溃而无法初始化或变得不可达。
English
Distributed training process (rank 2) failed to initialize or became unreachable due to network partition, firewall, or process crash within the 30-second timeout window.
官方文档
https://pytorch.org/docs/stable/distributed.html#troubleshooting解决方案
-
Verify all ranks are started with same world_size and master_addr: torchrun --nproc_per_node=4 --master_port=29500 train.py
-
Check firewall rules and ensure port 29500 is open: nc -zv 10.0.0.5 29500
-
Use environment variable MASTER_ADDR and MASTER_PORT consistently across all nodes
无效尝试
常见但无效的做法:
-
60% 失败
If the rank process is dead or network is broken, longer timeout just delays failure.
-
85% 失败
Debug logs help diagnose but don't fix underlying connectivity issues.