cuda network_error ai_generated partial

RuntimeError: NCCL error: unhandled system error, NCCL version 2.18.5, error: IB device not found

ID: cuda/nccl-ib-device-not-found

Also available as: JSON · Markdown · 中文
80%Fix Rate
86%Confidence
1Evidence
2023-08-01First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
NCCL 2.18.5 active
NCCL 2.19.3 active
CUDA 11.8 active
CUDA 12.1 active

Root Cause

NCCL (NVIDIA Collective Communications Library) attempts to use InfiniBand (IB) for inter-GPU communication but cannot find any IB device, either because the IB driver is not loaded or the hardware is not present.

generic

中文

NCCL(NVIDIA 集体通信库)尝试使用 InfiniBand (IB) 进行 GPU 间通信,但找不到任何 IB 设备,可能是因为 IB 驱动程序未加载或硬件不存在。

Official Documentation

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-ib-disable

Workarounds

  1. 85% success Set the environment variable NCCL_IB_DISABLE=1 to force NCCL to use TCP/IP instead of InfiniBand for inter-node communication.
    Set the environment variable NCCL_IB_DISABLE=1 to force NCCL to use TCP/IP instead of InfiniBand for inter-node communication.
  2. 75% success If IB hardware is available but driver not loaded, load the InfiniBand kernel module: `modprobe ib_core` and verify with `ibstat`.
    If IB hardware is available but driver not loaded, load the InfiniBand kernel module: `modprobe ib_core` and verify with `ibstat`.

中文步骤

  1. Set the environment variable NCCL_IB_DISABLE=1 to force NCCL to use TCP/IP instead of InfiniBand for inter-node communication.
  2. If IB hardware is available but driver not loaded, load the InfiniBand kernel module: `modprobe ib_core` and verify with `ibstat`.

Dead Ends

Common approaches that don't work:

  1. 90% fail

    NCCL will still probe for IB devices; if hardware is absent, the error persists regardless of NCCL version.

  2. 95% fail

    The error is fatal; NCCL will abort the communication initialization, not just log a warning.