ncclSystemError
cuda
config_error
ai_generated
true
运行时错误:NCCL错误:版本不匹配,期望2.18.5但得到2.19.1
RuntimeError: NCCL error: version mismatch, expected 2.18.5 but got 2.19.1
ID: cuda/nccl-version-mismatch
90%修复率
87%置信度
1证据数
2024-03-20首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| NCCL 2.18.5 | active | — | — | — |
| NCCL 2.19.1 | active | — | — | — |
| PyTorch 2.1.0 | active | — | — | — |
根因分析
运行时使用的NCCL库版本与PyTorch期望的版本不同,通常是由于多个NCCL安装或LD_LIBRARY_PATH错误。
English
The NCCL library version used at runtime differs from the one expected by PyTorch, often due to multiple NCCL installations or incorrect LD_LIBRARY_PATH.
官方文档
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html解决方案
-
Set the environment variable `LD_LIBRARY_PATH` to point to the correct NCCL installation. For example: `export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/nccl:$LD_LIBRARY_PATH`. Alternatively, use `conda install -c conda-forge nccl` to ensure consistency.
无效尝试
常见但无效的做法:
-
95% 失败
Debugging does not resolve binary incompatibility.
-
70% 失败
PyTorch bundles its own NCCL, but system paths can override it.