cuda
network_error
ai_generated
partial
RuntimeError: NCCL error: unhandled system error, NCCL version 2.18.5, error: IB device not found
ID: cuda/nccl-ib-device-not-found
80%Fix Rate
86%Confidence
1Evidence
2023-08-01First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| NCCL 2.18.5 | active | — | — | — |
| NCCL 2.19.3 | active | — | — | — |
| CUDA 11.8 | active | — | — | — |
| CUDA 12.1 | active | — | — | — |
Root Cause
NCCL (NVIDIA Collective Communications Library) attempts to use InfiniBand (IB) for inter-GPU communication but cannot find any IB device, either because the IB driver is not loaded or the hardware is not present.
generic中文
NCCL(NVIDIA 集体通信库)尝试使用 InfiniBand (IB) 进行 GPU 间通信,但找不到任何 IB 设备,可能是因为 IB 驱动程序未加载或硬件不存在。
Official Documentation
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-ib-disableWorkarounds
-
85% success Set the environment variable NCCL_IB_DISABLE=1 to force NCCL to use TCP/IP instead of InfiniBand for inter-node communication.
Set the environment variable NCCL_IB_DISABLE=1 to force NCCL to use TCP/IP instead of InfiniBand for inter-node communication.
-
75% success If IB hardware is available but driver not loaded, load the InfiniBand kernel module: `modprobe ib_core` and verify with `ibstat`.
If IB hardware is available but driver not loaded, load the InfiniBand kernel module: `modprobe ib_core` and verify with `ibstat`.
中文步骤
Set the environment variable NCCL_IB_DISABLE=1 to force NCCL to use TCP/IP instead of InfiniBand for inter-node communication.
If IB hardware is available but driver not loaded, load the InfiniBand kernel module: `modprobe ib_core` and verify with `ibstat`.
Dead Ends
Common approaches that don't work:
-
90% fail
NCCL will still probe for IB devices; if hardware is absent, the error persists regardless of NCCL version.
-
95% fail
The error is fatal; NCCL will abort the communication initialization, not just log a warning.