# RuntimeError: NCCL error: socket error: connect to 192.168.1.100:29500 failed: Connection timed out

- **ID:** `cuda/nccl-socket-timeout-connect`
- **Domain:** cuda
- **Category:** network_error
- **Verification:** ai_generated
- **Fix Rate:** 80%

## Root Cause

NCCL cannot establish a TCP connection to a peer node in the distributed training setup, typically due to firewall rules, incorrect IP addresses, or network interface misconfiguration.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| NCCL 2.18.5 | active | — | — |
| NCCL 2.19.3 | active | — | — |
| CUDA 12.2 | active | — | — |
| PyTorch 2.1 | active | — | — |

## Workarounds

1. **Verify network connectivity between nodes using: nc -zv 192.168.1.100 29500 from each node. If it fails, open the port (default 29500) in the firewall or use a different port via NCCL_SOCKET_IFNAME and NCCL_IB_DISABLE=1.** (85% success)
   ```
   Verify network connectivity between nodes using: nc -zv 192.168.1.100 29500 from each node. If it fails, open the port (default 29500) in the firewall or use a different port via NCCL_SOCKET_IFNAME and NCCL_IB_DISABLE=1.
   ```
2. **Set NCCL_SOCKET_IFNAME=eth0 (or the correct network interface) to force NCCL to use a specific interface. Also ensure all nodes can reach each other via that interface.** (75% success)
   ```
   Set NCCL_SOCKET_IFNAME=eth0 (or the correct network interface) to force NCCL to use a specific interface. Also ensure all nodes can reach each other via that interface.
   ```
3. **Disable InfiniBand and use TCP only: export NCCL_IB_DISABLE=1. This is useful if IB is misconfigured but TCP works.** (70% success)
   ```
   Disable InfiniBand and use TCP only: export NCCL_IB_DISABLE=1. This is useful if IB is misconfigured but TCP works.
   ```

## Dead Ends

- **** — A timeout increase only delays failure; if the connection cannot be established, it will eventually time out regardless, wasting resources. (90% fail)
- **** — The issue is network-related, not installation-related. Reinstalling doesn't change firewall rules or IP configurations. (95% fail)
- **** — Socket connection logic is similar across versions; downgrading may introduce other bugs without fixing the connectivity problem. (85% fail)
