RESOURCE_EXHAUSTED cloud resource_error ai_generated true

Node pool upgrade failed: Resource exhausted: insufficient CPU available in zone us-central1-a

ID: cloud/gcp-gke-node-pool-upgrade-failed

Also available as: JSON · Markdown · 中文
82%Fix Rate
86%Confidence
1Evidence
2024-09-05First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
GKE: 1.28.5-gke.1500 active
Kubernetes: 1.28 active
Compute Engine: API v1 active

Root Cause

GKE cannot allocate new nodes during upgrade because the specified zone has insufficient CPU quota or capacity to host the additional temporary nodes required for the rolling update.

generic

中文

GKE 在升级期间无法分配新节点,因为指定区域的 CPU 配额或容量不足,无法容纳滚动更新所需的额外临时节点。

Official Documentation

https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster

Workarounds

  1. 90% success Request a quota increase in the GCP Console for Compute Engine CPUs in the affected region: IAM & Admin > Quotas > 'CPUs' > Edit Quota.
    Request a quota increase in the GCP Console for Compute Engine CPUs in the affected region: IAM & Admin > Quotas > 'CPUs' > Edit Quota.
  2. 85% success Use a surge upgrade with a different zone by adding a node pool in a zone with available capacity, then migrate workloads.
    Use a surge upgrade with a different zone by adding a node pool in a zone with available capacity, then migrate workloads.
  3. 80% success Temporarily reduce the number of replicas in the cluster to free up quota, then perform the upgrade.
    Temporarily reduce the number of replicas in the cluster to free up quota, then perform the upgrade.

中文步骤

  1. 在 GCP 控制台中为受影响区域的 Compute Engine CPU 请求增加配额:IAM 与管理 > 配额 > 'CPU' > 编辑配额。
  2. 使用不同区域的激增升级,在可用容量充足的区域添加节点池,然后迁移工作负载。
  3. 临时减少集群中的副本数以释放配额,然后执行升级。

Dead Ends

Common approaches that don't work:

  1. 85% fail

    More nodes consume more quota, worsening the exhaustion; the upgrade needs additional quota for temporary nodes, not larger pool.

  2. 60% fail

    Deletion frees quota but the new pool creation may still fail if zone capacity is insufficient at that time.

  3. 70% fail

    Smaller instances may not meet workload requirements; also, the zone may still lack capacity for any instance type.