pytorch network_error ai_generated partial

运行时错误:NCCL 通信器在 rank 2 上被中止。原始失败原因:看门狗回调超时

RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: watchdog callback timed out

ID: pytorch/ddp-nccl-timeout-during-allreduce

其他格式: JSON · Markdown 中文 · English
70%修复率
80%置信度
1证据数
2024-09-01首次发现

版本兼容性

版本状态引入弃用备注
pytorch>=1.12.0 active
nccl>=2.14 active

根因分析

NCCL 集合操作(如 allreduce)超时,因为某个 rank 缓慢或无响应,通常由于网络拥塞、GPU 计算不平衡或硬件故障。

English

A NCCL collective operation (e.g., allreduce) timed out because one rank is slow or unresponsive, often due to network congestion, GPU compute imbalance, or hardware failure.

generic

官方文档

https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group

解决方案

  1. 通过分析每个 rank 的前向/反向传播时间检查 GPU 计算不平衡。确保所有 rank 处理相似数量的数据(例如,使用 DistributedSampler 并设置 drop_last=True)。
  2. 将 NCCL 超时环境变量增加到更高值(例如 600 秒),以适应慢速网络或大型模型。
  3. 使用 NCCL_DEBUG=INFO 环境变量获取详细调试日志,并识别慢速 rank 或网络问题。

无效尝试

常见但无效的做法:

  1. 60% 失败

    The timeout is a symptom, not the root cause; increasing it delays failure but doesn't prevent it.

  2. 70% 失败

    Barriers can cause all ranks to wait for the slow one, potentially increasing the timeout likelihood.

  3. 80% 失败

    The root cause (e.g., network latency) persists across restarts.