kubernetes system_error ai_generated partial

etcdserver: request timed out, possible leader election

ID: kubernetes/etcd-leader-election-failure

Also available as: JSON · Markdown · 中文

70%Fix Rate

88%Confidence

1Evidence

2023-09-05First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
etcd 3.5.7	active	—	—	—
etcd 3.5.9	active	—	—	—
Kubernetes 1.27	active	—	—	—
Kubernetes 1.29	active	—	—	—

Root Cause

etcd cluster is experiencing network partition or disk I/O latency, causing leader election to fail or take too long, resulting in timeouts for Kubernetes API requests.

generic

中文

etcd 集群遇到网络分区或磁盘 I/O 延迟，导致领导者选举失败或耗时过长，从而导致 Kubernetes API 请求超时。

Official Documentation

https://etcd.io/docs/v3.5/faq/#what-does-request-timed-out-mean

Workarounds

80% success Check etcd cluster health: `etcdctl endpoint health --cluster`. Identify unhealthy members and check their disk I/O with `iostat -x 1` or network latency with `ping` between etcd nodes.
```
Check etcd cluster health: `etcdctl endpoint health --cluster`. Identify unhealthy members and check their disk I/O with `iostat -x 1` or network latency with `ping` between etcd nodes.
```
75% success If disk I/O is high, move etcd data directory to a faster disk (e.g., SSD) by updating the etcd pod spec's hostPath or using a dedicated volume: `--data-dir=/var/lib/etcd-ssd`.
```
If disk I/O is high, move etcd data directory to a faster disk (e.g., SSD) by updating the etcd pod spec's hostPath or using a dedicated volume: `--data-dir=/var/lib/etcd-ssd`.
```
70% success If network partition is suspected, ensure all etcd members can communicate on port 2380 (peer communication). Check firewall rules and network policies.
```
If network partition is suspected, ensure all etcd members can communicate on port 2380 (peer communication). Check firewall rules and network policies.
```

中文步骤

Check etcd cluster health: `etcdctl endpoint health --cluster`. Identify unhealthy members and check their disk I/O with `iostat -x 1` or network latency with `ping` between etcd nodes.

If disk I/O is high, move etcd data directory to a faster disk (e.g., SSD) by updating the etcd pod spec's hostPath or using a dedicated volume: `--data-dir=/var/lib/etcd-ssd`.

If network partition is suspected, ensure all etcd members can communicate on port 2380 (peer communication). Check firewall rules and network policies.

Dead Ends

Common approaches that don't work:

70% fail
Simply restarting one etcd member may worsen the situation by triggering another leader election.
60% fail
Increasing etcd request timeout without fixing underlying disk or network issues only masks the problem temporarily.