# RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: watchdog callback timed out

- **ID:** `pytorch/ddp-nccl-timeout-during-allreduce`
- **Domain:** pytorch
- **Category:** network_error
- **Verification:** ai_generated
- **Fix Rate:** 70%

## Root Cause

A NCCL collective operation (e.g., allreduce) timed out because one rank is slow or unresponsive, often due to network congestion, GPU compute imbalance, or hardware failure.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| pytorch>=1.12.0 | active | — | — |
| nccl>=2.14 | active | — | — |

## Workarounds

1. **Check for GPU compute imbalance by profiling each rank's forward/backward time. Ensure all ranks process similar amounts of data (e.g., use DistributedSampler with drop_last=True).** (80% success)
   ```
   Check for GPU compute imbalance by profiling each rank's forward/backward time. Ensure all ranks process similar amounts of data (e.g., use DistributedSampler with drop_last=True).
   ```
2. **Increase the NCCL timeout environment variable to a higher value (e.g., 600 seconds) to accommodate slow networks or large models.** (75% success)
   ```
   Increase the NCCL timeout environment variable to a higher value (e.g., 600 seconds) to accommodate slow networks or large models.
   ```
3. **Use the NCCL_DEBUG=INFO environment variable to get detailed debug logs and identify the slow rank or network issue.** (85% success)
   ```
   Use the NCCL_DEBUG=INFO environment variable to get detailed debug logs and identify the slow rank or network issue.
   ```

## Dead Ends

- **** — The timeout is a symptom, not the root cause; increasing it delays failure but doesn't prevent it. (60% fail)
- **** — Barriers can cause all ranks to wait for the slow one, potentially increasing the timeout likelihood. (70% fail)
- **** — The root cause (e.g., network latency) persists across restarts. (80% fail)
