pytorch
network_error
ai_generated
true
运行时错误:[torch.distributed] 障碍同步超时,持续 600000 毫秒
RuntimeError: [torch.distributed] Barrier timeout after 600000 ms
ID: pytorch/distributed-barrier-timeout
80%修复率
88%置信度
1证据数
2023-05-20首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| torch>=1.13.0 | active | — | — | — |
| NCCL>=2.14 | active | — | — | — |
根因分析
分布式训练设置中的某个进程未能在超时时间内到达障碍同步点,通常是由于死锁、进程速度过慢或网络分区。
English
A rank in a distributed training setup failed to reach the barrier within the timeout, usually due to a deadlock, a slow rank, or a network partition.
官方文档
https://pytorch.org/docs/stable/distributed.html解决方案
-
Check for uneven data loading across ranks. Use torch.utils.data.DistributedSampler with drop_last=True to ensure all ranks have the same number of batches.
-
Increase the NCCL timeout and add logging to identify which rank is slow.
无效尝试
常见但无效的做法:
-
80% 失败
Simply increasing the timeout does not fix the root cause, such as a deadlock or imbalanced workload.
-
90% 失败
Killing and restarting all processes without addressing the imbalance or network issue will result in the same error.