kubernetes system_error ai_generated partial

etcdserver:请求超时,可能正在进行领导者选举

etcdserver: request timed out, possible leader election

ID: kubernetes/etcd-leader-election-failure

其他格式: JSON · Markdown 中文 · English
70%修复率
88%置信度
1证据数
2023-09-05首次发现

版本兼容性

版本状态引入弃用备注
etcd 3.5.7 active
etcd 3.5.9 active
Kubernetes 1.27 active
Kubernetes 1.29 active

根因分析

etcd 集群遇到网络分区或磁盘 I/O 延迟,导致领导者选举失败或耗时过长,从而导致 Kubernetes API 请求超时。

English

etcd cluster is experiencing network partition or disk I/O latency, causing leader election to fail or take too long, resulting in timeouts for Kubernetes API requests.

generic

官方文档

https://etcd.io/docs/v3.5/faq/#what-does-request-timed-out-mean

解决方案

  1. Check etcd cluster health: `etcdctl endpoint health --cluster`. Identify unhealthy members and check their disk I/O with `iostat -x 1` or network latency with `ping` between etcd nodes.
  2. If disk I/O is high, move etcd data directory to a faster disk (e.g., SSD) by updating the etcd pod spec's hostPath or using a dedicated volume: `--data-dir=/var/lib/etcd-ssd`.
  3. If network partition is suspected, ensure all etcd members can communicate on port 2380 (peer communication). Check firewall rules and network policies.

无效尝试

常见但无效的做法:

  1. 70% 失败

    Simply restarting one etcd member may worsen the situation by triggering another leader election.

  2. 60% 失败

    Increasing etcd request timeout without fixing underlying disk or network issues only masks the problem temporarily.