pytorch network_error ai_generated true

RuntimeError: [torch.distributed] Barrier timeout after 600000 ms

ID: pytorch/distributed-barrier-timeout

Also available as: JSON · Markdown · 中文
80%Fix Rate
88%Confidence
1Evidence
2023-05-20First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
torch>=1.13.0 active
NCCL>=2.14 active

Root Cause

A rank in a distributed training setup failed to reach the barrier within the timeout, usually due to a deadlock, a slow rank, or a network partition.

generic

中文

分布式训练设置中的某个进程未能在超时时间内到达障碍同步点,通常是由于死锁、进程速度过慢或网络分区。

Official Documentation

https://pytorch.org/docs/stable/distributed.html

Workarounds

  1. 80% success Check for uneven data loading across ranks. Use torch.utils.data.DistributedSampler with drop_last=True to ensure all ranks have the same number of batches.
    Check for uneven data loading across ranks. Use torch.utils.data.DistributedSampler with drop_last=True to ensure all ranks have the same number of batches.
  2. 70% success Increase the NCCL timeout and add logging to identify which rank is slow.
    Increase the NCCL timeout and add logging to identify which rank is slow.

中文步骤

  1. Check for uneven data loading across ranks. Use torch.utils.data.DistributedSampler with drop_last=True to ensure all ranks have the same number of batches.
  2. Increase the NCCL timeout and add logging to identify which rank is slow.

Dead Ends

Common approaches that don't work:

  1. 80% fail

    Simply increasing the timeout does not fix the root cause, such as a deadlock or imbalanced workload.

  2. 90% fail

    Killing and restarting all processes without addressing the imbalance or network issue will result in the same error.