ncclSystemError cuda config_error ai_generated true

RuntimeError: NCCL error: version mismatch, expected 2.18.5 but got 2.19.1

ID: cuda/nccl-version-mismatch

Also available as: JSON · Markdown · 中文
90%Fix Rate
87%Confidence
1Evidence
2024-03-20First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
NCCL 2.18.5 active
NCCL 2.19.1 active
PyTorch 2.1.0 active

Root Cause

The NCCL library version used at runtime differs from the one expected by PyTorch, often due to multiple NCCL installations or incorrect LD_LIBRARY_PATH.

generic

中文

运行时使用的NCCL库版本与PyTorch期望的版本不同,通常是由于多个NCCL安装或LD_LIBRARY_PATH错误。

Official Documentation

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html

Workarounds

  1. 90% success Set the environment variable `LD_LIBRARY_PATH` to point to the correct NCCL installation. For example: `export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/nccl:$LD_LIBRARY_PATH`. Alternatively, use `conda install -c conda-forge nccl` to ensure consistency.
    Set the environment variable `LD_LIBRARY_PATH` to point to the correct NCCL installation. For example: `export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/nccl:$LD_LIBRARY_PATH`. Alternatively, use `conda install -c conda-forge nccl` to ensure consistency.

中文步骤

  1. Set the environment variable `LD_LIBRARY_PATH` to point to the correct NCCL installation. For example: `export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/nccl:$LD_LIBRARY_PATH`. Alternatively, use `conda install -c conda-forge nccl` to ensure consistency.

Dead Ends

Common approaches that don't work:

  1. 95% fail

    Debugging does not resolve binary incompatibility.

  2. 70% fail

    PyTorch bundles its own NCCL, but system paths can override it.