GKE 节点池升级失败:升级期间节点出现磁盘压力状况
GKE node pool upgrade failed: Node had condition DiskPressure during upgrade
ID: cloud/gcp-kubernetes-node-pool-upgrade-failed-disk-pressure
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| GKE: >= 1.27 | active | — | — | — |
| Kubernetes: >= 1.24 | active | — | — | — |
| Google Cloud SDK: >= 400.0.0 | active | — | — | — |
根因分析
在 GKE 节点池升级期间,节点可能因本地临时存储(例如容器镜像、日志)超过节点磁盘容量而无法排空,导致 kubelet 报告磁盘压力并阻止 Pod 驱逐。
English
During a GKE node pool upgrade, nodes may fail to drain because local ephemeral storage (e.g., container images, logs) exceeds the node's disk capacity, causing the kubelet to report DiskPressure and prevent pod eviction.
官方文档
https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster#troubleshooting解决方案
-
Manually evict pods from the problematic node using 'kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data' after clearing disk space by removing unused container images with 'docker image prune -a' or 'crictl rmi --prune' on the node (via SSH).
-
Configure a PodDisruptionBudget for critical workloads and ensure node disk usage is below 80% before starting the upgrade. Use monitoring to check disk usage: 'kubectl describe node <node-name> | grep DiskPressure'.
-
Use a node pool with local SSDs or larger boot disks (e.g., 100 GB) to provide more ephemeral storage, and enable 'gcplogs' with log rotation to prevent log buildup.
无效尝试
常见但无效的做法:
-
60% 失败
Force deletion can cause data loss for stateful workloads and doesn't clean up the underlying disk pressure issue on new nodes.
-
70% 失败
Resizing doesn't take effect until the node is recreated; during upgrade, the old node still has DiskPressure and can't drain.
-
90% 失败
GKE managed node pools don't allow custom kubelet configurations; this change is not supported and may be reverted.