cuda
network_error
ai_generated
partial
RuntimeError: NCCL error: socket error: connect to 192.168.1.100:29500 failed: Connection timed out
ID: cuda/nccl-socket-timeout-connect
80%Fix Rate
87%Confidence
1Evidence
2023-03-10First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| NCCL 2.18.5 | active | — | — | — |
| NCCL 2.19.3 | active | — | — | — |
| CUDA 12.2 | active | — | — | — |
| PyTorch 2.1 | active | — | — | — |
Root Cause
NCCL cannot establish a TCP connection to a peer node in the distributed training setup, typically due to firewall rules, incorrect IP addresses, or network interface misconfiguration.
generic中文
NCCL 无法在分布式训练设置中与对等节点建立 TCP 连接,通常是由于防火墙规则、IP 地址错误或网络接口配置错误。
Official Documentation
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.htmlWorkarounds
-
85% success Verify network connectivity between nodes using: nc -zv 192.168.1.100 29500 from each node. If it fails, open the port (default 29500) in the firewall or use a different port via NCCL_SOCKET_IFNAME and NCCL_IB_DISABLE=1.
Verify network connectivity between nodes using: nc -zv 192.168.1.100 29500 from each node. If it fails, open the port (default 29500) in the firewall or use a different port via NCCL_SOCKET_IFNAME and NCCL_IB_DISABLE=1.
-
75% success Set NCCL_SOCKET_IFNAME=eth0 (or the correct network interface) to force NCCL to use a specific interface. Also ensure all nodes can reach each other via that interface.
Set NCCL_SOCKET_IFNAME=eth0 (or the correct network interface) to force NCCL to use a specific interface. Also ensure all nodes can reach each other via that interface.
-
70% success Disable InfiniBand and use TCP only: export NCCL_IB_DISABLE=1. This is useful if IB is misconfigured but TCP works.
Disable InfiniBand and use TCP only: export NCCL_IB_DISABLE=1. This is useful if IB is misconfigured but TCP works.
中文步骤
使用命令 nc -zv 192.168.1.100 29500 从每个节点验证网络连通性。如果失败,请在防火墙中打开该端口(默认 29500),或使用 NCCL_SOCKET_IFNAME 和 NCCL_IB_DISABLE=1 指定不同的端口。
设置 NCCL_SOCKET_IFNAME=eth0(或正确的网络接口),强制 NCCL 使用特定接口。同时确保所有节点可以通过该接口相互访问。
禁用 InfiniBand 并仅使用 TCP:export NCCL_IB_DISABLE=1。如果 IB 配置错误但 TCP 正常工作,此方法很有用。
Dead Ends
Common approaches that don't work:
-
90% fail
A timeout increase only delays failure; if the connection cannot be established, it will eventually time out regardless, wasting resources.
-
95% fail
The issue is network-related, not installation-related. Reinstalling doesn't change firewall rules or IP configurations.
-
85% fail
Socket connection logic is similar across versions; downgrading may introduce other bugs without fixing the connectivity problem.