kubernetes system_error ai_generated partial

etcdserver: request timed out, possible leader election

ID: kubernetes/etcd-leader-election-failure

Also available as: JSON · Markdown · 中文
70%Fix Rate
88%Confidence
1Evidence
2023-09-05First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
etcd 3.5.7 active
etcd 3.5.9 active
Kubernetes 1.27 active
Kubernetes 1.29 active

Root Cause

etcd cluster is experiencing network partition or disk I/O latency, causing leader election to fail or take too long, resulting in timeouts for Kubernetes API requests.

generic

中文

etcd 集群遇到网络分区或磁盘 I/O 延迟,导致领导者选举失败或耗时过长,从而导致 Kubernetes API 请求超时。

Official Documentation

https://etcd.io/docs/v3.5/faq/#what-does-request-timed-out-mean

Workarounds

  1. 80% success Check etcd cluster health: `etcdctl endpoint health --cluster`. Identify unhealthy members and check their disk I/O with `iostat -x 1` or network latency with `ping` between etcd nodes.
    Check etcd cluster health: `etcdctl endpoint health --cluster`. Identify unhealthy members and check their disk I/O with `iostat -x 1` or network latency with `ping` between etcd nodes.
  2. 75% success If disk I/O is high, move etcd data directory to a faster disk (e.g., SSD) by updating the etcd pod spec's hostPath or using a dedicated volume: `--data-dir=/var/lib/etcd-ssd`.
    If disk I/O is high, move etcd data directory to a faster disk (e.g., SSD) by updating the etcd pod spec's hostPath or using a dedicated volume: `--data-dir=/var/lib/etcd-ssd`.
  3. 70% success If network partition is suspected, ensure all etcd members can communicate on port 2380 (peer communication). Check firewall rules and network policies.
    If network partition is suspected, ensure all etcd members can communicate on port 2380 (peer communication). Check firewall rules and network policies.

中文步骤

  1. Check etcd cluster health: `etcdctl endpoint health --cluster`. Identify unhealthy members and check their disk I/O with `iostat -x 1` or network latency with `ping` between etcd nodes.
  2. If disk I/O is high, move etcd data directory to a faster disk (e.g., SSD) by updating the etcd pod spec's hostPath or using a dedicated volume: `--data-dir=/var/lib/etcd-ssd`.
  3. If network partition is suspected, ensure all etcd members can communicate on port 2380 (peer communication). Check firewall rules and network policies.

Dead Ends

Common approaches that don't work:

  1. 70% fail

    Simply restarting one etcd member may worsen the situation by triggering another leader election.

  2. 60% fail

    Increasing etcd request timeout without fixing underlying disk or network issues only masks the problem temporarily.