cuda network_error ai_generated partial

运行时错误:NCCL 错误:未处理的系统错误,NCCL 版本 2.18.5,错误:未找到 IB 设备

RuntimeError: NCCL error: unhandled system error, NCCL version 2.18.5, error: IB device not found

ID: cuda/nccl-ib-device-not-found

其他格式: JSON · Markdown 中文 · English
80%修复率
86%置信度
1证据数
2023-08-01首次发现

版本兼容性

版本状态引入弃用备注
NCCL 2.18.5 active
NCCL 2.19.3 active
CUDA 11.8 active
CUDA 12.1 active

根因分析

NCCL(NVIDIA 集体通信库)尝试使用 InfiniBand (IB) 进行 GPU 间通信,但找不到任何 IB 设备,可能是因为 IB 驱动程序未加载或硬件不存在。

English

NCCL (NVIDIA Collective Communications Library) attempts to use InfiniBand (IB) for inter-GPU communication but cannot find any IB device, either because the IB driver is not loaded or the hardware is not present.

generic

官方文档

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-ib-disable

解决方案

  1. Set the environment variable NCCL_IB_DISABLE=1 to force NCCL to use TCP/IP instead of InfiniBand for inter-node communication.
  2. If IB hardware is available but driver not loaded, load the InfiniBand kernel module: `modprobe ib_core` and verify with `ibstat`.

无效尝试

常见但无效的做法:

  1. 90% 失败

    NCCL will still probe for IB devices; if hardware is absent, the error persists regardless of NCCL version.

  2. 95% 失败

    The error is fatal; NCCL will abort the communication initialization, not just log a warning.