kubernetes system_error ai_generated partial

Error from server: etcdserver: request timed out, possible leader election

ID: kubernetes/etcd-leader-election-timeout

Also available as: JSON · Markdown · 中文
75%Fix Rate
85%Confidence
1Evidence
2023-06-20First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
etcd 3.5 active
kubernetes 1.27 active
kubernetes 1.28 active

Root Cause

The etcd cluster is experiencing a leader election or network partition, causing API server requests to time out.

generic

中文

etcd 集群正在进行领导者选举或遇到网络分区,导致 API 服务器请求超时。

Official Documentation

https://etcd.io/docs/v3.5/faq/#what-does-etcd-request-timed-out-mean

Workarounds

  1. 80% success Run `etcdctl endpoint health --cluster` and `etcdctl endpoint status --cluster -w table` to identify unhealthy members. If a leader is missing, ensure a majority of etcd nodes are reachable.
    Run `etcdctl endpoint health --cluster` and `etcdctl endpoint status --cluster -w table` to identify unhealthy members. If a leader is missing, ensure a majority of etcd nodes are reachable.
  2. 70% success Use `ETCDCTL_API=3 etcdctl snapshot restore /path/to/backup.db --data-dir /var/lib/etcd` on a new etcd instance, then restart the API server pointing to the restored etcd.
    Use `ETCDCTL_API=3 etcdctl snapshot restore /path/to/backup.db --data-dir /var/lib/etcd` on a new etcd instance, then restart the API server pointing to the restored etcd.

中文步骤

  1. 运行 `etcdctl endpoint health --cluster` 和 `etcdctl endpoint status --cluster -w table` 来识别不健康的成员。如果缺少领导者,确保大多数 etcd 节点可达。
  2. 使用 `ETCDCTL_API=3 etcdctl snapshot restore /path/to/backup.db --data-dir /var/lib/etcd` 在新的 etcd 实例上,然后重启指向恢复后 etcd 的 API 服务器。

Dead Ends

Common approaches that don't work:

  1. 90% fail

    The API server is not the root cause; restarting it won't fix etcd instability.

  2. 70% fail

    Longer timeouts may mask the issue but don't address the underlying etcd cluster problem.

  3. 60% fail

    If the cluster is in a leader election, rebooting nodes can worsen the situation and cause data loss.