# 运行时错误：NCCL 错误：未处理的系统错误，NCCL 版本 2.18.5，错误：未找到 IB 设备

- **ID:** `cuda/nccl-ib-device-not-found`
- **领域:** cuda
- **类别:** network_error
- **验证级别:** ai_generated
- **修复率:** 80%

## 根因

NCCL（NVIDIA 集体通信库）尝试使用 InfiniBand (IB) 进行 GPU 间通信，但找不到任何 IB 设备，可能是因为 IB 驱动程序未加载或硬件不存在。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| NCCL 2.18.5 | active | — | — |
| NCCL 2.19.3 | active | — | — |
| CUDA 11.8 | active | — | — |
| CUDA 12.1 | active | — | — |

## 解决方案

1. ```
   Set the environment variable NCCL_IB_DISABLE=1 to force NCCL to use TCP/IP instead of InfiniBand for inter-node communication.
   ```
2. ```
   If IB hardware is available but driver not loaded, load the InfiniBand kernel module: `modprobe ib_core` and verify with `ibstat`.
   ```

## 无效尝试

- **** — NCCL will still probe for IB devices; if hardware is absent, the error persists regardless of NCCL version. (90% 失败率)
- **** — The error is fatal; NCCL will abort the communication initialization, not just log a warning. (95% 失败率)
