ES_PERSISTENT_TASK_ASSIGN_FAIL elasticsearch runtime_error ai_generated partial

PersistentTaskException: task [cluster:admin/persistent/assignment] failed to assign task [task_id_123] to node [node-1] after [5] attempts

ID: elasticsearch/persistent-task-assignment-failure

Also available as: JSON · Markdown · 中文
82%Fix Rate
85%Confidence
1Evidence
2024-06-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
7.17.0 active
8.11.0 active
8.12.0 active

Root Cause

A persistent task (e.g., ILM, Rollup, Watcher) cannot be assigned to any available node because of node attribute mismatches, resource constraints, or cluster topology changes during rolling restart.

generic

中文

持久化任务(例如ILM、Rollup、Watcher)由于节点属性不匹配、资源限制或滚动重启期间集群拓扑变化而无法分配给任何可用节点。

Official Documentation

https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html

Workarounds

  1. 85% success Ensure all nodes have the required attributes set in `elasticsearch.yml` (e.g., `node.attr.rack: r1`) and restart nodes one by one, waiting for shard recovery after each restart.
    Ensure all nodes have the required attributes set in `elasticsearch.yml` (e.g., `node.attr.rack: r1`) and restart nodes one by one, waiting for shard recovery after each restart.
  2. 75% success Use the `_tasks` API to reassign the task manually: `POST _tasks/task_id_123/_cancel` then `POST _tasks/task_id_123/_retry`.
    Use the `_tasks` API to reassign the task manually: `POST _tasks/task_id_123/_cancel` then `POST _tasks/task_id_123/_retry`.
  3. 80% success Check node resource availability (CPU, memory) and scale up or add more nodes to the cluster to free up capacity.
    Check node resource availability (CPU, memory) and scale up or add more nodes to the cluster to free up capacity.

中文步骤

  1. Ensure all nodes have the required attributes set in `elasticsearch.yml` (e.g., `node.attr.rack: r1`) and restart nodes one by one, waiting for shard recovery after each restart.
  2. Use the `_tasks` API to reassign the task manually: `POST _tasks/task_id_123/_cancel` then `POST _tasks/task_id_123/_retry`.
  3. Check node resource availability (CPU, memory) and scale up or add more nodes to the cluster to free up capacity.

Dead Ends

Common approaches that don't work:

  1. 85% fail

    Forceful restart causes more assignment failures as tasks lose their target nodes and can't reassign mid-restart.

  2. 75% fail

    Retries don't fix the underlying node attribute or resource issue; they only delay the eventual failure.

  3. 90% fail

    This removes the task but loses its progress, and the task may be recreated by the system (e.g., ILM) causing the same error again.