# GKE 节点池升级失败：升级期间节点出现磁盘压力状况

- **ID:** `cloud/gcp-kubernetes-node-pool-upgrade-failed-disk-pressure`
- **领域:** cloud
- **类别:** system_error
- **错误码:** `GKE.NodePoolUpgrade.DiskPressure`
- **验证级别:** ai_generated
- **修复率:** 75%

## 根因

在 GKE 节点池升级期间，节点可能因本地临时存储（例如容器镜像、日志）超过节点磁盘容量而无法排空，导致 kubelet 报告磁盘压力并阻止 Pod 驱逐。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| GKE: >= 1.27 | active | — | — |
| Kubernetes: >= 1.24 | active | — | — |
| Google Cloud SDK: >= 400.0.0 | active | — | — |

## 解决方案

1. ```
   Manually evict pods from the problematic node using 'kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data' after clearing disk space by removing unused container images with 'docker image prune -a' or 'crictl rmi --prune' on the node (via SSH).
   ```
2. ```
   Configure a PodDisruptionBudget for critical workloads and ensure node disk usage is below 80% before starting the upgrade. Use monitoring to check disk usage: 'kubectl describe node <node-name> | grep DiskPressure'.
   ```
3. ```
   Use a node pool with local SSDs or larger boot disks (e.g., 100 GB) to provide more ephemeral storage, and enable 'gcplogs' with log rotation to prevent log buildup.
   ```

## 无效尝试

- **** — Force deletion can cause data loss for stateful workloads and doesn't clean up the underlying disk pressure issue on new nodes. (60% 失败率)
- **** — Resizing doesn't take effect until the node is recreated; during upgrade, the old node still has DiskPressure and can't drain. (70% 失败率)
- **** — GKE managed node pools don't allow custom kubelet configurations; this change is not supported and may be reverted. (90% 失败率)
