ERR redis network_error ai_generated partial

ERR CLUSTERDOWN The cluster is down - node timeout, replica not synced

ID: redis/cluster-node-timeout-replica

Also available as: JSON · Markdown · 中文
82%Fix Rate
87%Confidence
1Evidence
2024-01-18First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
7.2.0 active
7.4.0 active
8.0.0 active

Root Cause

A cluster node timed out and its replica is not fully synced, causing the cluster to lose quorum and mark itself as down.

generic

中文

集群节点超时且其副本未完全同步,导致集群失去法定人数并将自身标记为关闭。

Official Documentation

https://redis.io/docs/manual/scaling/

Workarounds

  1. 80% success Force the replica to sync: CLUSTER REPLICATE <master-node-id>. Then wait for replication to complete using CLUSTER INFO.
    Force the replica to sync: CLUSTER REPLICATE <master-node-id>. Then wait for replication to complete using CLUSTER INFO.
  2. 85% success If the master is permanently down, promote the replica to master: CLUSTER FAILOVER FORCE on the replica node.
    If the master is permanently down, promote the replica to master: CLUSTER FAILOVER FORCE on the replica node.
  3. 70% success Increase cluster-node-timeout to a higher value (e.g., 30000ms) to tolerate transient network issues: CONFIG SET cluster-node-timeout 30000.
    Increase cluster-node-timeout to a higher value (e.g., 30000ms) to tolerate transient network issues: CONFIG SET cluster-node-timeout 30000.

中文步骤

  1. 强制副本同步:CLUSTER REPLICATE <主节点ID>。然后使用 CLUSTER INFO 等待复制完成。
  2. 如果主节点永久关闭,将副本提升为主节点:在副本节点上执行 CLUSTER FAILOVER FORCE。
  3. 增加 cluster-node-timeout 到更高值(例如 30000ms)以容忍临时网络问题:CONFIG SET cluster-node-timeout 30000。

Dead Ends

Common approaches that don't work:

  1. 70% fail

    Restarting the failed node without fixing the replica sync will cause the same timeout again because the replica is still behind.

  2. 50% fail

    Increasing cluster-node-timeout alone without addressing network issues or replica sync lag will not prevent future timeouts.