kubernetes
system_error
ai_generated
partial
etcdserver: request timed out, possible leader election
ID: kubernetes/etcd-leader-election-failure
70%Fix Rate
88%Confidence
1Evidence
2023-09-05First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| etcd 3.5.7 | active | — | — | — |
| etcd 3.5.9 | active | — | — | — |
| Kubernetes 1.27 | active | — | — | — |
| Kubernetes 1.29 | active | — | — | — |
Root Cause
etcd cluster is experiencing network partition or disk I/O latency, causing leader election to fail or take too long, resulting in timeouts for Kubernetes API requests.
generic中文
etcd 集群遇到网络分区或磁盘 I/O 延迟,导致领导者选举失败或耗时过长,从而导致 Kubernetes API 请求超时。
Official Documentation
https://etcd.io/docs/v3.5/faq/#what-does-request-timed-out-meanWorkarounds
-
80% success Check etcd cluster health: `etcdctl endpoint health --cluster`. Identify unhealthy members and check their disk I/O with `iostat -x 1` or network latency with `ping` between etcd nodes.
Check etcd cluster health: `etcdctl endpoint health --cluster`. Identify unhealthy members and check their disk I/O with `iostat -x 1` or network latency with `ping` between etcd nodes.
-
75% success If disk I/O is high, move etcd data directory to a faster disk (e.g., SSD) by updating the etcd pod spec's hostPath or using a dedicated volume: `--data-dir=/var/lib/etcd-ssd`.
If disk I/O is high, move etcd data directory to a faster disk (e.g., SSD) by updating the etcd pod spec's hostPath or using a dedicated volume: `--data-dir=/var/lib/etcd-ssd`.
-
70% success If network partition is suspected, ensure all etcd members can communicate on port 2380 (peer communication). Check firewall rules and network policies.
If network partition is suspected, ensure all etcd members can communicate on port 2380 (peer communication). Check firewall rules and network policies.
中文步骤
Check etcd cluster health: `etcdctl endpoint health --cluster`. Identify unhealthy members and check their disk I/O with `iostat -x 1` or network latency with `ping` between etcd nodes.
If disk I/O is high, move etcd data directory to a faster disk (e.g., SSD) by updating the etcd pod spec's hostPath or using a dedicated volume: `--data-dir=/var/lib/etcd-ssd`.
If network partition is suspected, ensure all etcd members can communicate on port 2380 (peer communication). Check firewall rules and network policies.
Dead Ends
Common approaches that don't work:
-
70% fail
Simply restarting one etcd member may worsen the situation by triggering another leader election.
-
60% fail
Increasing etcd request timeout without fixing underlying disk or network issues only masks the problem temporarily.