pytorch network_error ai_generated true

运行时错误:[torch.distributed] 障碍同步超时,持续 600000 毫秒

RuntimeError: [torch.distributed] Barrier timeout after 600000 ms

ID: pytorch/distributed-barrier-timeout

其他格式: JSON · Markdown 中文 · English
80%修复率
88%置信度
1证据数
2023-05-20首次发现

版本兼容性

版本状态引入弃用备注
torch>=1.13.0 active
NCCL>=2.14 active

根因分析

分布式训练设置中的某个进程未能在超时时间内到达障碍同步点,通常是由于死锁、进程速度过慢或网络分区。

English

A rank in a distributed training setup failed to reach the barrier within the timeout, usually due to a deadlock, a slow rank, or a network partition.

generic

官方文档

https://pytorch.org/docs/stable/distributed.html

解决方案

  1. Check for uneven data loading across ranks. Use torch.utils.data.DistributedSampler with drop_last=True to ensure all ranks have the same number of batches.
  2. Increase the NCCL timeout and add logging to identify which rank is slow.

无效尝试

常见但无效的做法:

  1. 80% 失败

    Simply increasing the timeout does not fix the root cause, such as a deadlock or imbalanced workload.

  2. 90% 失败

    Killing and restarting all processes without addressing the imbalance or network issue will result in the same error.