pytorch network_error ai_generated true

运行时错误：[torch.distributed] 障碍同步超时，持续 600000 毫秒

RuntimeError: [torch.distributed] Barrier timeout after 600000 ms

ID: pytorch/distributed-barrier-timeout

其他格式: JSON · Markdown 中文 · English

80%修复率

88%置信度

1证据数

2023-05-20首次发现

版本兼容性

版本	状态	引入	弃用	备注
torch>=1.13.0	active	—	—	—
NCCL>=2.14	active	—	—	—

根因分析

分布式训练设置中的某个进程未能在超时时间内到达障碍同步点，通常是由于死锁、进程速度过慢或网络分区。

English

A rank in a distributed training setup failed to reach the barrier within the timeout, usually due to a deadlock, a slow rank, or a network partition.

generic

官方文档

https://pytorch.org/docs/stable/distributed.html

解决方案

Check for uneven data loading across ranks. Use torch.utils.data.DistributedSampler with drop_last=True to ensure all ranks have the same number of batches.

Increase the NCCL timeout and add logging to identify which rank is slow.

无效尝试

常见但无效的做法:

80% 失败
Simply increasing the timeout does not fix the root cause, such as a deadlock or imbalanced workload.
90% 失败
Killing and restarting all processes without addressing the imbalance or network issue will result in the same error.