RESOURCE_EXHAUSTED cloud resource_error ai_generated true

Node pool upgrade failed: Resource exhausted: insufficient CPU available in zone us-central1-a

ID: cloud/gcp-gke-node-pool-upgrade-failed

Also available as: JSON · Markdown · 中文

82%Fix Rate

86%Confidence

1Evidence

2024-09-05First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
GKE: 1.28.5-gke.1500	active	—	—	—
Kubernetes: 1.28	active	—	—	—
Compute Engine: API v1	active	—	—	—

Root Cause

GKE cannot allocate new nodes during upgrade because the specified zone has insufficient CPU quota or capacity to host the additional temporary nodes required for the rolling update.

generic

中文

GKE 在升级期间无法分配新节点，因为指定区域的 CPU 配额或容量不足，无法容纳滚动更新所需的额外临时节点。

Official Documentation

https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster

Workarounds

90% success Request a quota increase in the GCP Console for Compute Engine CPUs in the affected region: IAM & Admin > Quotas > 'CPUs' > Edit Quota.
```
Request a quota increase in the GCP Console for Compute Engine CPUs in the affected region: IAM & Admin > Quotas > 'CPUs' > Edit Quota.
```
85% success Use a surge upgrade with a different zone by adding a node pool in a zone with available capacity, then migrate workloads.
```
Use a surge upgrade with a different zone by adding a node pool in a zone with available capacity, then migrate workloads.
```
80% success Temporarily reduce the number of replicas in the cluster to free up quota, then perform the upgrade.
```
Temporarily reduce the number of replicas in the cluster to free up quota, then perform the upgrade.
```

中文步骤

在 GCP 控制台中为受影响区域的 Compute Engine CPU 请求增加配额：IAM 与管理 > 配额 > 'CPU' > 编辑配额。

使用不同区域的激增升级，在可用容量充足的区域添加节点池，然后迁移工作负载。

临时减少集群中的副本数以释放配额，然后执行升级。

Dead Ends

Common approaches that don't work:

85% fail
More nodes consume more quota, worsening the exhaustion; the upgrade needs additional quota for temporary nodes, not larger pool.
60% fail
Deletion frees quota but the new pool creation may still fail if zone capacity is insufficient at that time.
70% fail
Smaller instances may not meet workload requirements; also, the zone may still lack capacity for any instance type.