cuda network_error ai_generated partial

运行时错误：NCCL 错误：套接字错误：连接到 192.168.1.100:29500 失败：连接超时

RuntimeError: NCCL error: socket error: connect to 192.168.1.100:29500 failed: Connection timed out

ID: cuda/nccl-socket-timeout-connect

其他格式: JSON · Markdown 中文 · English

80%修复率

87%置信度

1证据数

2023-03-10首次发现

版本兼容性

版本	状态	引入	弃用	备注
NCCL 2.18.5	active	—	—	—
NCCL 2.19.3	active	—	—	—
CUDA 12.2	active	—	—	—
PyTorch 2.1	active	—	—	—

根因分析

NCCL 无法在分布式训练设置中与对等节点建立 TCP 连接，通常是由于防火墙规则、IP 地址错误或网络接口配置错误。

English

NCCL cannot establish a TCP connection to a peer node in the distributed training setup, typically due to firewall rules, incorrect IP addresses, or network interface misconfiguration.

generic

官方文档

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html

解决方案

使用命令 nc -zv 192.168.1.100 29500 从每个节点验证网络连通性。如果失败，请在防火墙中打开该端口（默认 29500），或使用 NCCL_SOCKET_IFNAME 和 NCCL_IB_DISABLE=1 指定不同的端口。

设置 NCCL_SOCKET_IFNAME=eth0（或正确的网络接口），强制 NCCL 使用特定接口。同时确保所有节点可以通过该接口相互访问。

禁用 InfiniBand 并仅使用 TCP：export NCCL_IB_DISABLE=1。如果 IB 配置错误但 TCP 正常工作，此方法很有用。

无效尝试

常见但无效的做法:

90% 失败
A timeout increase only delays failure; if the connection cannot be established, it will eventually time out regardless, wasting resources.
95% 失败
The issue is network-related, not installation-related. Reinstalling doesn't change firewall rules or IP configurations.
85% 失败
Socket connection logic is similar across versions; downgrading may introduce other bugs without fixing the connectivity problem.