pytorch network_error ai_generated partial

运行时错误：NCCL 通信器在 rank 2 上被中止。原始失败原因：看门狗回调超时

RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: watchdog callback timed out

ID: pytorch/ddp-nccl-timeout-during-allreduce

其他格式: JSON · Markdown 中文 · English

70%修复率

80%置信度

1证据数

2024-09-01首次发现

版本兼容性

版本	状态	引入	弃用	备注
pytorch>=1.12.0	active	—	—	—
nccl>=2.14	active	—	—	—

根因分析

NCCL 集合操作（如 allreduce）超时，因为某个 rank 缓慢或无响应，通常由于网络拥塞、GPU 计算不平衡或硬件故障。

English

A NCCL collective operation (e.g., allreduce) timed out because one rank is slow or unresponsive, often due to network congestion, GPU compute imbalance, or hardware failure.

generic

官方文档

https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group

解决方案

通过分析每个 rank 的前向/反向传播时间检查 GPU 计算不平衡。确保所有 rank 处理相似数量的数据（例如，使用 DistributedSampler 并设置 drop_last=True）。

将 NCCL 超时环境变量增加到更高值（例如 600 秒），以适应慢速网络或大型模型。

使用 NCCL_DEBUG=INFO 环境变量获取详细调试日志，并识别慢速 rank 或网络问题。

无效尝试

常见但无效的做法:

60% 失败
The timeout is a symptom, not the root cause; increasing it delays failure but doesn't prevent it.
70% 失败
Barriers can cause all ranks to wait for the slow one, potentially increasing the timeout likelihood.
80% 失败
The root cause (e.g., network latency) persists across restarts.