K8S-LEADER-001 kubernetes system_error ai_generated true

Election: leader election lost

ID: kubernetes/leader-election-lost

Also available as: JSON · Markdown · 中文
80%Fix Rate
85%Confidence
1Evidence
2023-06-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
kubernetes 1.23 active
kubernetes 1.24 active
kubernetes 1.25 active
kubernetes 1.28 active

Root Cause

A controller or operator pod lost its lease lock due to network partition, pod restart, or etcd timeout, causing a temporary leadership gap.

generic

中文

控制器或操作器 Pod 因网络分区、Pod 重启或 etcd 超时而丢失租约锁,导致临时领导权空缺。

Official Documentation

https://kubernetes.io/docs/concepts/architecture/controller/

Workarounds

  1. 85% success Scale down the controller deployment to 0, wait 30 seconds, then scale back up to 1 to force a clean leader election.
    Scale down the controller deployment to 0, wait 30 seconds, then scale back up to 1 to force a clean leader election.
  2. 75% success Check network policies or firewall rules that may block communication between controller replicas on port 2380 (etcd peer port).
    Check network policies or firewall rules that may block communication between controller replicas on port 2380 (etcd peer port).

中文步骤

  1. 将控制器 Deployment 缩容至 0,等待 30 秒,再扩容至 1,以强制进行干净的领导者选举。
  2. 检查可能阻止控制器副本之间在端口 2380(etcd 对等端口)上通信的网络策略或防火墙规则。

Dead Ends

Common approaches that don't work:

  1. Restart all replicas of the controller simultaneously. 65% fail

    Restarting all replicas at once can cause a prolonged leader election storm, making the problem worse.

  2. Delete the lease object in etcd manually. 80% fail

    Manually deleting the lease may cause data inconsistency and is not recommended; the leader election mechanism should self-heal.