elasticsearch runtime_error ai_generated partial

ElasticsearchException: failed to reroute after shard allocation

ID: elasticsearch/transient-cluster-routing-error

Also available as: JSON · Markdown · 中文
75%Fix Rate
85%Confidence
1Evidence
2024-03-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
7.17.0 active
8.11.0 active
8.12.0 active

Root Cause

A transient cluster routing failure occurred when the master node attempted to reassign shards after a node join or leave, often due to a stale cluster state or network partition.

generic

中文

当主节点在节点加入或离开后尝试重新分配分片时发生的临时集群路由故障,通常由于过时的集群状态或网络分区引起。

Official Documentation

https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html

Workarounds

  1. 80% success Force a cluster state update by temporarily disabling shard allocation, then re-enabling it: `PUT _cluster/settings {"transient": {"cluster.routing.allocation.enable": "none"}}` wait 30 seconds, then `PUT _cluster/settings {"transient": {"cluster.routing.allocation.enable": "all"}}`.
    Force a cluster state update by temporarily disabling shard allocation, then re-enabling it: `PUT _cluster/settings {"transient": {"cluster.routing.allocation.enable": "none"}}` wait 30 seconds, then `PUT _cluster/settings {"transient": {"cluster.routing.allocation.enable": "all"}}`.
  2. 70% success Clear the stale cluster state by running `POST /_cluster/reroute?retry_failed=true` to retry failed allocation commands.
    Clear the stale cluster state by running `POST /_cluster/reroute?retry_failed=true` to retry failed allocation commands.
  3. 75% success If the error persists, take a snapshot of the cluster state with `POST /_snapshot/repo/backup` and then restart the master node with `--cluster.routing.allocation.disk.threshold_enabled=false` to bypass disk checks temporarily.
    If the error persists, take a snapshot of the cluster state with `POST /_snapshot/repo/backup` and then restart the master node with `--cluster.routing.allocation.disk.threshold_enabled=false` to bypass disk checks temporarily.

中文步骤

  1. Force a cluster state update by temporarily disabling shard allocation, then re-enabling it: `PUT _cluster/settings {"transient": {"cluster.routing.allocation.enable": "none"}}` wait 30 seconds, then `PUT _cluster/settings {"transient": {"cluster.routing.allocation.enable": "all"}}`.
  2. Clear the stale cluster state by running `POST /_cluster/reroute?retry_failed=true` to retry failed allocation commands.
  3. If the error persists, take a snapshot of the cluster state with `POST /_snapshot/repo/backup` and then restart the master node with `--cluster.routing.allocation.disk.threshold_enabled=false` to bypass disk checks temporarily.

Dead Ends

Common approaches that don't work:

  1. 85% fail

    Restart does not resolve stale cluster state or network issues; the master will reattempt the same failed reroute.

  2. 70% fail

    Manual reroute bypasses cluster state validation, potentially causing duplicate shard copies or lost primary shards.

  3. 60% fail

    Higher concurrency may amplify the impact of stale routing decisions, leading to more failed reroute attempts.