pytorch network_error ai_generated partial

RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: watchdog callback timed out

ID: pytorch/ddp-nccl-timeout-during-allreduce

Also available as: JSON · Markdown · 中文

70%Fix Rate

80%Confidence

1Evidence

2024-09-01First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
pytorch>=1.12.0	active	—	—	—
nccl>=2.14	active	—	—	—

Root Cause

A NCCL collective operation (e.g., allreduce) timed out because one rank is slow or unresponsive, often due to network congestion, GPU compute imbalance, or hardware failure.

generic

中文

NCCL 集合操作（如 allreduce）超时，因为某个 rank 缓慢或无响应，通常由于网络拥塞、GPU 计算不平衡或硬件故障。

Official Documentation

https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group

Workarounds

80% success Check for GPU compute imbalance by profiling each rank's forward/backward time. Ensure all ranks process similar amounts of data (e.g., use DistributedSampler with drop_last=True).
```
Check for GPU compute imbalance by profiling each rank's forward/backward time. Ensure all ranks process similar amounts of data (e.g., use DistributedSampler with drop_last=True).
```
75% success Increase the NCCL timeout environment variable to a higher value (e.g., 600 seconds) to accommodate slow networks or large models.
```
Increase the NCCL timeout environment variable to a higher value (e.g., 600 seconds) to accommodate slow networks or large models.
```
85% success Use the NCCL_DEBUG=INFO environment variable to get detailed debug logs and identify the slow rank or network issue.
```
Use the NCCL_DEBUG=INFO environment variable to get detailed debug logs and identify the slow rank or network issue.
```

中文步骤

通过分析每个 rank 的前向/反向传播时间检查 GPU 计算不平衡。确保所有 rank 处理相似数量的数据（例如，使用 DistributedSampler 并设置 drop_last=True）。

将 NCCL 超时环境变量增加到更高值（例如 600 秒），以适应慢速网络或大型模型。

使用 NCCL_DEBUG=INFO 环境变量获取详细调试日志，并识别慢速 rank 或网络问题。

Dead Ends

Common approaches that don't work:

60% fail
The timeout is a symptom, not the root cause; increasing it delays failure but doesn't prevent it.
70% fail
Barriers can cause all ranks to wait for the slow one, potentially increasing the timeout likelihood.
80% fail
The root cause (e.g., network latency) persists across restarts.