pytorch
network_error
ai_generated
partial
运行时错误:NCCL 通信器在 rank 2 上被中止。原始失败原因:看门狗回调超时
RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: watchdog callback timed out
ID: pytorch/ddp-nccl-timeout-during-allreduce
70%修复率
80%置信度
1证据数
2024-09-01首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| pytorch>=1.12.0 | active | — | — | — |
| nccl>=2.14 | active | — | — | — |
根因分析
NCCL 集合操作(如 allreduce)超时,因为某个 rank 缓慢或无响应,通常由于网络拥塞、GPU 计算不平衡或硬件故障。
English
A NCCL collective operation (e.g., allreduce) timed out because one rank is slow or unresponsive, often due to network congestion, GPU compute imbalance, or hardware failure.
官方文档
https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group解决方案
-
通过分析每个 rank 的前向/反向传播时间检查 GPU 计算不平衡。确保所有 rank 处理相似数量的数据(例如,使用 DistributedSampler 并设置 drop_last=True)。
-
将 NCCL 超时环境变量增加到更高值(例如 600 秒),以适应慢速网络或大型模型。
-
使用 NCCL_DEBUG=INFO 环境变量获取详细调试日志,并识别慢速 rank 或网络问题。
无效尝试
常见但无效的做法:
-
60% 失败
The timeout is a symptom, not the root cause; increasing it delays failure but doesn't prevent it.
-
70% 失败
Barriers can cause all ranks to wait for the slow one, potentially increasing the timeout likelihood.
-
80% 失败
The root cause (e.g., network latency) persists across restarts.