RESOURCE_EXHAUSTED
cloud
resource_error
ai_generated
true
Node pool upgrade failed: Resource exhausted: insufficient CPU available in zone us-central1-a
ID: cloud/gcp-gke-node-pool-upgrade-failed
82%Fix Rate
86%Confidence
1Evidence
2024-09-05First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| GKE: 1.28.5-gke.1500 | active | — | — | — |
| Kubernetes: 1.28 | active | — | — | — |
| Compute Engine: API v1 | active | — | — | — |
Root Cause
GKE cannot allocate new nodes during upgrade because the specified zone has insufficient CPU quota or capacity to host the additional temporary nodes required for the rolling update.
generic中文
GKE 在升级期间无法分配新节点,因为指定区域的 CPU 配额或容量不足,无法容纳滚动更新所需的额外临时节点。
Official Documentation
https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-clusterWorkarounds
-
90% success Request a quota increase in the GCP Console for Compute Engine CPUs in the affected region: IAM & Admin > Quotas > 'CPUs' > Edit Quota.
Request a quota increase in the GCP Console for Compute Engine CPUs in the affected region: IAM & Admin > Quotas > 'CPUs' > Edit Quota.
-
85% success Use a surge upgrade with a different zone by adding a node pool in a zone with available capacity, then migrate workloads.
Use a surge upgrade with a different zone by adding a node pool in a zone with available capacity, then migrate workloads.
-
80% success Temporarily reduce the number of replicas in the cluster to free up quota, then perform the upgrade.
Temporarily reduce the number of replicas in the cluster to free up quota, then perform the upgrade.
中文步骤
在 GCP 控制台中为受影响区域的 Compute Engine CPU 请求增加配额:IAM 与管理 > 配额 > 'CPU' > 编辑配额。
使用不同区域的激增升级,在可用容量充足的区域添加节点池,然后迁移工作负载。
临时减少集群中的副本数以释放配额,然后执行升级。
Dead Ends
Common approaches that don't work:
-
85% fail
More nodes consume more quota, worsening the exhaustion; the upgrade needs additional quota for temporary nodes, not larger pool.
-
60% fail
Deletion frees quota but the new pool creation may still fail if zone capacity is insufficient at that time.
-
70% fail
Smaller instances may not meet workload requirements; also, the zone may still lack capacity for any instance type.