pytorch network_error ai_generated partial

RuntimeError: [torch.distributed] Connection closed by peer: cannot connect to rank 2 at 10.0.0.5:29500 (timeout=30)

ID: pytorch/distributed-timeout-connection

Also available as: JSON · Markdown · 中文
75%Fix Rate
87%Confidence
1Evidence
2023-06-10First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
torch>=1.10.0 active
torch>=2.0.0 active

Root Cause

Distributed training process (rank 2) failed to initialize or became unreachable due to network partition, firewall, or process crash within the 30-second timeout window.

generic

中文

分布式训练进程(rank 2)在30秒超时窗口内因网络分区、防火墙或进程崩溃而无法初始化或变得不可达。

Official Documentation

https://pytorch.org/docs/stable/distributed.html#troubleshooting

Workarounds

  1. 85% success Verify all ranks are started with same world_size and master_addr: torchrun --nproc_per_node=4 --master_port=29500 train.py
    Verify all ranks are started with same world_size and master_addr: torchrun --nproc_per_node=4 --master_port=29500 train.py
  2. 90% success Check firewall rules and ensure port 29500 is open: nc -zv 10.0.0.5 29500
    Check firewall rules and ensure port 29500 is open: nc -zv 10.0.0.5 29500
  3. 80% success Use environment variable MASTER_ADDR and MASTER_PORT consistently across all nodes
    Use environment variable MASTER_ADDR and MASTER_PORT consistently across all nodes

中文步骤

  1. Verify all ranks are started with same world_size and master_addr: torchrun --nproc_per_node=4 --master_port=29500 train.py
  2. Check firewall rules and ensure port 29500 is open: nc -zv 10.0.0.5 29500
  3. Use environment variable MASTER_ADDR and MASTER_PORT consistently across all nodes

Dead Ends

Common approaches that don't work:

  1. 60% fail

    If the rank process is dead or network is broken, longer timeout just delays failure.

  2. 85% fail

    Debug logs help diagnose but don't fix underlying connectivity issues.