GKE node pool upgrade failed: Node had condition DiskPressure during upgrade
ID: cloud/gcp-kubernetes-node-pool-upgrade-failed-disk-pressure
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| GKE: >= 1.27 | active | — | — | — |
| Kubernetes: >= 1.24 | active | — | — | — |
| Google Cloud SDK: >= 400.0.0 | active | — | — | — |
Root Cause
During a GKE node pool upgrade, nodes may fail to drain because local ephemeral storage (e.g., container images, logs) exceeds the node's disk capacity, causing the kubelet to report DiskPressure and prevent pod eviction.
generic中文
在 GKE 节点池升级期间,节点可能因本地临时存储(例如容器镜像、日志)超过节点磁盘容量而无法排空,导致 kubelet 报告磁盘压力并阻止 Pod 驱逐。
Official Documentation
https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster#troubleshootingWorkarounds
-
75% success Manually evict pods from the problematic node using 'kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data' after clearing disk space by removing unused container images with 'docker image prune -a' or 'crictl rmi --prune' on the node (via SSH).
Manually evict pods from the problematic node using 'kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data' after clearing disk space by removing unused container images with 'docker image prune -a' or 'crictl rmi --prune' on the node (via SSH).
-
80% success Configure a PodDisruptionBudget for critical workloads and ensure node disk usage is below 80% before starting the upgrade. Use monitoring to check disk usage: 'kubectl describe node <node-name> | grep DiskPressure'.
Configure a PodDisruptionBudget for critical workloads and ensure node disk usage is below 80% before starting the upgrade. Use monitoring to check disk usage: 'kubectl describe node <node-name> | grep DiskPressure'.
-
85% success Use a node pool with local SSDs or larger boot disks (e.g., 100 GB) to provide more ephemeral storage, and enable 'gcplogs' with log rotation to prevent log buildup.
Use a node pool with local SSDs or larger boot disks (e.g., 100 GB) to provide more ephemeral storage, and enable 'gcplogs' with log rotation to prevent log buildup.
中文步骤
Manually evict pods from the problematic node using 'kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data' after clearing disk space by removing unused container images with 'docker image prune -a' or 'crictl rmi --prune' on the node (via SSH).
Configure a PodDisruptionBudget for critical workloads and ensure node disk usage is below 80% before starting the upgrade. Use monitoring to check disk usage: 'kubectl describe node <node-name> | grep DiskPressure'.
Use a node pool with local SSDs or larger boot disks (e.g., 100 GB) to provide more ephemeral storage, and enable 'gcplogs' with log rotation to prevent log buildup.
Dead Ends
Common approaches that don't work:
-
60% fail
Force deletion can cause data loss for stateful workloads and doesn't clean up the underlying disk pressure issue on new nodes.
-
70% fail
Resizing doesn't take effect until the node is recreated; during upgrade, the old node still has DiskPressure and can't drain.
-
90% fail
GKE managed node pools don't allow custom kubelet configurations; this change is not supported and may be reverted.