# RuntimeError: NCCL error: unhandled system error, NCCL version 2.18.5, error: IB device not found

- **ID:** `cuda/nccl-ib-device-not-found`
- **Domain:** cuda
- **Category:** network_error
- **Verification:** ai_generated
- **Fix Rate:** 80%

## Root Cause

NCCL (NVIDIA Collective Communications Library) attempts to use InfiniBand (IB) for inter-GPU communication but cannot find any IB device, either because the IB driver is not loaded or the hardware is not present.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| NCCL 2.18.5 | active | — | — |
| NCCL 2.19.3 | active | — | — |
| CUDA 11.8 | active | — | — |
| CUDA 12.1 | active | — | — |

## Workarounds

1. **Set the environment variable NCCL_IB_DISABLE=1 to force NCCL to use TCP/IP instead of InfiniBand for inter-node communication.** (85% success)
   ```
   Set the environment variable NCCL_IB_DISABLE=1 to force NCCL to use TCP/IP instead of InfiniBand for inter-node communication.
   ```
2. **If IB hardware is available but driver not loaded, load the InfiniBand kernel module: `modprobe ib_core` and verify with `ibstat`.** (75% success)
   ```
   If IB hardware is available but driver not loaded, load the InfiniBand kernel module: `modprobe ib_core` and verify with `ibstat`.
   ```

## Dead Ends

- **** — NCCL will still probe for IB devices; if hardware is absent, the error persists regardless of NCCL version. (90% fail)
- **** — The error is fatal; NCCL will abort the communication initialization, not just log a warning. (95% fail)
