pytorch network_error ai_generated partial

RuntimeError:[torch.distributed] 对等连接关闭:无法连接到rank 2的10.0.0.5:29500(超时=30)

RuntimeError: [torch.distributed] Connection closed by peer: cannot connect to rank 2 at 10.0.0.5:29500 (timeout=30)

ID: pytorch/distributed-timeout-connection

其他格式: JSON · Markdown 中文 · English
75%修复率
87%置信度
1证据数
2023-06-10首次发现

版本兼容性

版本状态引入弃用备注
torch>=1.10.0 active
torch>=2.0.0 active

根因分析

分布式训练进程(rank 2)在30秒超时窗口内因网络分区、防火墙或进程崩溃而无法初始化或变得不可达。

English

Distributed training process (rank 2) failed to initialize or became unreachable due to network partition, firewall, or process crash within the 30-second timeout window.

generic

官方文档

https://pytorch.org/docs/stable/distributed.html#troubleshooting

解决方案

  1. Verify all ranks are started with same world_size and master_addr: torchrun --nproc_per_node=4 --master_port=29500 train.py
  2. Check firewall rules and ensure port 29500 is open: nc -zv 10.0.0.5 29500
  3. Use environment variable MASTER_ADDR and MASTER_PORT consistently across all nodes

无效尝试

常见但无效的做法:

  1. 60% 失败

    If the rank process is dead or network is broken, longer timeout just delays failure.

  2. 85% 失败

    Debug logs help diagnose but don't fix underlying connectivity issues.