ncclSystemError cuda config_error ai_generated true

运行时错误:NCCL错误:版本不匹配,期望2.18.5但得到2.19.1

RuntimeError: NCCL error: version mismatch, expected 2.18.5 but got 2.19.1

ID: cuda/nccl-version-mismatch

其他格式: JSON · Markdown 中文 · English
90%修复率
87%置信度
1证据数
2024-03-20首次发现

版本兼容性

版本状态引入弃用备注
NCCL 2.18.5 active
NCCL 2.19.1 active
PyTorch 2.1.0 active

根因分析

运行时使用的NCCL库版本与PyTorch期望的版本不同,通常是由于多个NCCL安装或LD_LIBRARY_PATH错误。

English

The NCCL library version used at runtime differs from the one expected by PyTorch, often due to multiple NCCL installations or incorrect LD_LIBRARY_PATH.

generic

官方文档

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html

解决方案

  1. Set the environment variable `LD_LIBRARY_PATH` to point to the correct NCCL installation. For example: `export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/nccl:$LD_LIBRARY_PATH`. Alternatively, use `conda install -c conda-forge nccl` to ensure consistency.

无效尝试

常见但无效的做法:

  1. 95% 失败

    Debugging does not resolve binary incompatibility.

  2. 70% 失败

    PyTorch bundles its own NCCL, but system paths can override it.