elasticsearch system_error ai_generated true

ShardLockObtainFailedException: [my_index][0] obtaining shard lock failed

ID: elasticsearch/primary-shard-not-allocated-due-to-shard-lock

Also available as: JSON · Markdown · 中文
82%Fix Rate
85%Confidence
1Evidence
2024-03-12First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
7.17.10 active
8.6.2 active
8.11.0 active

Root Cause

A shard lock cannot be acquired because the shard is still being recovered or a previous node crash left a stale lock file on disk.

generic

中文

分片锁无法获取,因为分片仍在恢复中,或者之前的节点崩溃在磁盘上留下了过期的锁文件。

Official Documentation

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-recovery.html

Workarounds

  1. 90% success Remove the stale shard lock file manually: find the shard directory under ES_PATH_CONF/data/nodes/0/indices/<index-uuid>/0/ and delete the 'index.lock' or 'shard.lock' file, then restart the node.
    Remove the stale shard lock file manually: find the shard directory under ES_PATH_CONF/data/nodes/0/indices/<index-uuid>/0/ and delete the 'index.lock' or 'shard.lock' file, then restart the node.
  2. 75% success Reroute the unassigned shard using the Cluster Reroute API: POST /_cluster/reroute { "commands": [{ "allocate_stale_primary": { "index": "my_index", "shard": 0, "node": "my_node", "accept_data_loss": true } }] }
    Reroute the unassigned shard using the Cluster Reroute API: POST /_cluster/reroute { "commands": [{ "allocate_stale_primary": { "index": "my_index", "shard": 0, "node": "my_node", "accept_data_loss": true } }] }

中文步骤

  1. 手动删除过期的分片锁文件:找到 ES_PATH_CONF/data/nodes/0/indices/<索引UUID>/0/ 目录下的 'index.lock' 或 'shard.lock' 文件,删除后重启节点。
  2. 使用 Cluster Reroute API 重新分配未分配的分片:POST /_cluster/reroute { "commands": [{ "allocate_stale_primary": { "index": "my_index", "shard": 0, "node": "my_node", "accept_data_loss": true } }] }

Dead Ends

Common approaches that don't work:

  1. 70% fail

    Deleting the index loses all data and may not resolve the underlying lock issue if the file system is corrupted.

  2. 85% fail

    Restarting does not remove stale lock files; the lock file persists and the same error occurs again.