启动探测失败:HTTP探测返回状态码:503
Startup probe failed: HTTP probe failed with statuscode: 503
ID: cloud/gcp-cloud-run-container-startup-probe-failure
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| Cloud Run (managed) 2024 | active | — | — | — |
| Knative Serving 1.12 | active | — | — | — |
| gcloud CLI 462 | active | — | — | — |
根因分析
Cloud Run的启动探测端点在初始启动期内返回非2xx状态(503),通常是因为应用程序初始化时间超过了探测的initialDelaySeconds或periodSeconds所允许的时间。
English
Cloud Run's startup probe endpoint is returning a non-2xx status (503) within the initial startup period, often because the application takes longer to initialize than the probe's initialDelaySeconds or periodSeconds allows.
官方文档
https://cloud.google.com/run/docs/configuring/healthchecks#startup-probes解决方案
-
Increase startup probe initialDelaySeconds to match application startup time: `gcloud run services update my-service --startup-probe-initial-delay=60 --startup-probe-period=10 --startup-probe-failure-threshold=6`. Also ensure the /health endpoint returns 200 only after full initialization.
-
Implement a health check endpoint that returns 503 until the application is ready, then 200. Example in Python Flask: `@app.route('/health') def health(): return ('OK', 200) if app_ready else ('Service Unavailable', 503)`. Set app_ready = True after initialization completes.
无效尝试
常见但无效的做法:
-
60% 失败
Resource increase may not reduce startup time if the application has a fixed initialization delay (e.g., loading ML models).
-
80% 失败
Cloud Run requires a startup probe for long-running services; removing it may cause the container to be killed before it finishes initializing.
-
75% 失败
Too short a period causes rapid retries that may overwhelm the application during startup, worsening the issue.