# etcdserver: request timed out, possible leader election

- **ID:** `kubernetes/etcd-leader-election-failure`
- **Domain:** kubernetes
- **Category:** system_error
- **Verification:** ai_generated
- **Fix Rate:** 70%

## Root Cause

etcd cluster is experiencing network partition or disk I/O latency, causing leader election to fail or take too long, resulting in timeouts for Kubernetes API requests.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| etcd 3.5.7 | active | — | — |
| etcd 3.5.9 | active | — | — |
| Kubernetes 1.27 | active | — | — |
| Kubernetes 1.29 | active | — | — |

## Workarounds

1. **Check etcd cluster health: `etcdctl endpoint health --cluster`. Identify unhealthy members and check their disk I/O with `iostat -x 1` or network latency with `ping` between etcd nodes.** (80% success)
   ```
   Check etcd cluster health: `etcdctl endpoint health --cluster`. Identify unhealthy members and check their disk I/O with `iostat -x 1` or network latency with `ping` between etcd nodes.
   ```
2. **If disk I/O is high, move etcd data directory to a faster disk (e.g., SSD) by updating the etcd pod spec's hostPath or using a dedicated volume: `--data-dir=/var/lib/etcd-ssd`.** (75% success)
   ```
   If disk I/O is high, move etcd data directory to a faster disk (e.g., SSD) by updating the etcd pod spec's hostPath or using a dedicated volume: `--data-dir=/var/lib/etcd-ssd`.
   ```
3. **If network partition is suspected, ensure all etcd members can communicate on port 2380 (peer communication). Check firewall rules and network policies.** (70% success)
   ```
   If network partition is suspected, ensure all etcd members can communicate on port 2380 (peer communication). Check firewall rules and network policies.
   ```

## Dead Ends

- **** — Simply restarting one etcd member may worsen the situation by triggering another leader election. (70% fail)
- **** — Increasing etcd request timeout without fixing underlying disk or network issues only masks the problem temporarily. (60% fail)
