pytorch
network_error
ai_generated
true
RuntimeError: [torch.distributed] Barrier timeout after 600000 ms
ID: pytorch/distributed-barrier-timeout
80%Fix Rate
88%Confidence
1Evidence
2023-05-20First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| torch>=1.13.0 | active | — | — | — |
| NCCL>=2.14 | active | — | — | — |
Root Cause
A rank in a distributed training setup failed to reach the barrier within the timeout, usually due to a deadlock, a slow rank, or a network partition.
generic中文
分布式训练设置中的某个进程未能在超时时间内到达障碍同步点,通常是由于死锁、进程速度过慢或网络分区。
Official Documentation
https://pytorch.org/docs/stable/distributed.htmlWorkarounds
-
80% success Check for uneven data loading across ranks. Use torch.utils.data.DistributedSampler with drop_last=True to ensure all ranks have the same number of batches.
Check for uneven data loading across ranks. Use torch.utils.data.DistributedSampler with drop_last=True to ensure all ranks have the same number of batches.
-
70% success Increase the NCCL timeout and add logging to identify which rank is slow.
Increase the NCCL timeout and add logging to identify which rank is slow.
中文步骤
Check for uneven data loading across ranks. Use torch.utils.data.DistributedSampler with drop_last=True to ensure all ranks have the same number of batches.
Increase the NCCL timeout and add logging to identify which rank is slow.
Dead Ends
Common approaches that don't work:
-
80% fail
Simply increasing the timeout does not fix the root cause, such as a deadlock or imbalanced workload.
-
90% fail
Killing and restarting all processes without addressing the imbalance or network issue will result in the same error.