elasticsearch
runtime_error
ai_generated
partial
Elasticsearch异常:分片分配后路由重试失败
ElasticsearchException: failed to reroute after shard allocation
ID: elasticsearch/transient-cluster-routing-error
75%修复率
85%置信度
1证据数
2024-03-15首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| 7.17.0 | active | — | — | — |
| 8.11.0 | active | — | — | — |
| 8.12.0 | active | — | — | — |
根因分析
当主节点在节点加入或离开后尝试重新分配分片时发生的临时集群路由故障,通常由于过时的集群状态或网络分区引起。
English
A transient cluster routing failure occurred when the master node attempted to reassign shards after a node join or leave, often due to a stale cluster state or network partition.
官方文档
https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html解决方案
-
Force a cluster state update by temporarily disabling shard allocation, then re-enabling it: `PUT _cluster/settings {"transient": {"cluster.routing.allocation.enable": "none"}}` wait 30 seconds, then `PUT _cluster/settings {"transient": {"cluster.routing.allocation.enable": "all"}}`. -
Clear the stale cluster state by running `POST /_cluster/reroute?retry_failed=true` to retry failed allocation commands.
-
If the error persists, take a snapshot of the cluster state with `POST /_snapshot/repo/backup` and then restart the master node with `--cluster.routing.allocation.disk.threshold_enabled=false` to bypass disk checks temporarily.
无效尝试
常见但无效的做法:
-
85% 失败
Restart does not resolve stale cluster state or network issues; the master will reattempt the same failed reroute.
-
70% 失败
Manual reroute bypasses cluster state validation, potentially causing duplicate shard copies or lost primary shards.
-
60% 失败
Higher concurrency may amplify the impact of stale routing decisions, leading to more failed reroute attempts.