pytorch
network_error
ai_generated
partial
RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: watchdog callback timed out
ID: pytorch/ddp-nccl-timeout-during-allreduce
70%Fix Rate
80%Confidence
1Evidence
2024-09-01First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| pytorch>=1.12.0 | active | — | — | — |
| nccl>=2.14 | active | — | — | — |
Root Cause
A NCCL collective operation (e.g., allreduce) timed out because one rank is slow or unresponsive, often due to network congestion, GPU compute imbalance, or hardware failure.
generic中文
NCCL 集合操作(如 allreduce)超时,因为某个 rank 缓慢或无响应,通常由于网络拥塞、GPU 计算不平衡或硬件故障。
Official Documentation
https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_groupWorkarounds
-
80% success Check for GPU compute imbalance by profiling each rank's forward/backward time. Ensure all ranks process similar amounts of data (e.g., use DistributedSampler with drop_last=True).
Check for GPU compute imbalance by profiling each rank's forward/backward time. Ensure all ranks process similar amounts of data (e.g., use DistributedSampler with drop_last=True).
-
75% success Increase the NCCL timeout environment variable to a higher value (e.g., 600 seconds) to accommodate slow networks or large models.
Increase the NCCL timeout environment variable to a higher value (e.g., 600 seconds) to accommodate slow networks or large models.
-
85% success Use the NCCL_DEBUG=INFO environment variable to get detailed debug logs and identify the slow rank or network issue.
Use the NCCL_DEBUG=INFO environment variable to get detailed debug logs and identify the slow rank or network issue.
中文步骤
通过分析每个 rank 的前向/反向传播时间检查 GPU 计算不平衡。确保所有 rank 处理相似数量的数据(例如,使用 DistributedSampler 并设置 drop_last=True)。
将 NCCL 超时环境变量增加到更高值(例如 600 秒),以适应慢速网络或大型模型。
使用 NCCL_DEBUG=INFO 环境变量获取详细调试日志,并识别慢速 rank 或网络问题。
Dead Ends
Common approaches that don't work:
-
60% fail
The timeout is a symptom, not the root cause; increasing it delays failure but doesn't prevent it.
-
70% fail
Barriers can cause all ranks to wait for the slow one, potentially increasing the timeout likelihood.
-
80% fail
The root cause (e.g., network latency) persists across restarts.