# 运行时错误：[torch.distributed] 障碍同步超时，持续 600000 毫秒

- **ID:** `pytorch/distributed-barrier-timeout`
- **领域:** pytorch
- **类别:** network_error
- **验证级别:** ai_generated
- **修复率:** 80%

## 根因

分布式训练设置中的某个进程未能在超时时间内到达障碍同步点，通常是由于死锁、进程速度过慢或网络分区。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| torch>=1.13.0 | active | — | — |
| NCCL>=2.14 | active | — | — |

## 解决方案

1. ```
   Check for uneven data loading across ranks. Use torch.utils.data.DistributedSampler with drop_last=True to ensure all ranks have the same number of batches.
   ```
2. ```
   Increase the NCCL timeout and add logging to identify which rank is slow.
   ```

## 无效尝试

- **** — Simply increasing the timeout does not fix the root cause, such as a deadlock or imbalanced workload. (80% 失败率)
- **** — Killing and restarting all processes without addressing the imbalance or network issue will result in the same error. (90% 失败率)
