cuda
network_error
ai_generated
partial
运行时错误:NCCL 错误:未处理的系统错误,NCCL 版本 2.18.5,错误:未找到 IB 设备
RuntimeError: NCCL error: unhandled system error, NCCL version 2.18.5, error: IB device not found
ID: cuda/nccl-ib-device-not-found
80%修复率
86%置信度
1证据数
2023-08-01首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| NCCL 2.18.5 | active | — | — | — |
| NCCL 2.19.3 | active | — | — | — |
| CUDA 11.8 | active | — | — | — |
| CUDA 12.1 | active | — | — | — |
根因分析
NCCL(NVIDIA 集体通信库)尝试使用 InfiniBand (IB) 进行 GPU 间通信,但找不到任何 IB 设备,可能是因为 IB 驱动程序未加载或硬件不存在。
English
NCCL (NVIDIA Collective Communications Library) attempts to use InfiniBand (IB) for inter-GPU communication but cannot find any IB device, either because the IB driver is not loaded or the hardware is not present.
官方文档
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-ib-disable解决方案
-
Set the environment variable NCCL_IB_DISABLE=1 to force NCCL to use TCP/IP instead of InfiniBand for inter-node communication.
-
If IB hardware is available but driver not loaded, load the InfiniBand kernel module: `modprobe ib_core` and verify with `ibstat`.
无效尝试
常见但无效的做法:
-
90% 失败
NCCL will still probe for IB devices; if hardware is absent, the error persists regardless of NCCL version.
-
95% 失败
The error is fatal; NCCL will abort the communication initialization, not just log a warning.