GKE.NodePoolUpgrade.DiskPressure cloud system_error ai_generated partial

GKE 节点池升级失败：升级期间节点出现磁盘压力状况

GKE node pool upgrade failed: Node had condition DiskPressure during upgrade

ID: cloud/gcp-kubernetes-node-pool-upgrade-failed-disk-pressure

其他格式: JSON · Markdown 中文 · English

75%修复率

82%置信度

1证据数

2024-03-20首次发现

版本兼容性

版本	状态	引入	弃用	备注
GKE: >= 1.27	active	—	—	—
Kubernetes: >= 1.24	active	—	—	—
Google Cloud SDK: >= 400.0.0	active	—	—	—

根因分析

在 GKE 节点池升级期间，节点可能因本地临时存储（例如容器镜像、日志）超过节点磁盘容量而无法排空，导致 kubelet 报告磁盘压力并阻止 Pod 驱逐。

English

During a GKE node pool upgrade, nodes may fail to drain because local ephemeral storage (e.g., container images, logs) exceeds the node's disk capacity, causing the kubelet to report DiskPressure and prevent pod eviction.

generic

官方文档

https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster#troubleshooting

解决方案

Manually evict pods from the problematic node using 'kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data' after clearing disk space by removing unused container images with 'docker image prune -a' or 'crictl rmi --prune' on the node (via SSH).

Configure a PodDisruptionBudget for critical workloads and ensure node disk usage is below 80% before starting the upgrade. Use monitoring to check disk usage: 'kubectl describe node <node-name> | grep DiskPressure'.

Use a node pool with local SSDs or larger boot disks (e.g., 100 GB) to provide more ephemeral storage, and enable 'gcplogs' with log rotation to prevent log buildup.

无效尝试

常见但无效的做法:

60% 失败
Force deletion can cause data loss for stateful workloads and doesn't clean up the underlying disk pressure issue on new nodes.
70% 失败
Resizing doesn't take effect until the node is recreated; during upgrade, the old node still has DiskPressure and can't drain.
90% 失败
GKE managed node pools don't allow custom kubelet configurations; this change is not supported and may be reverted.