mongodb runtime_error ai_generated partial

MongoServerError: balancer round trip time exceeded 30 seconds: chunk migration timed out for range

ID: mongodb/balancer-round-trip-timeout

Also available as: JSON · Markdown · 中文
78%Fix Rate
85%Confidence
1Evidence
2024-05-12First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
mongodb 6.0 active
mongodb 7.0 active
mongodb 8.0 active

Root Cause

Chunk migration between shards failed due to network latency or overloaded source shard, exceeding the 30-second round-trip threshold.

generic

中文

由于网络延迟或源分片过载,分片间的块迁移失败,超过了30秒的往返时间阈值。

Official Documentation

https://www.mongodb.com/docs/manual/reference/command/moveChunk/

Workarounds

  1. 75% success Temporarily disable the balancer to allow pending migrations to complete: `sh.stopBalancer(10000)` then re-enable after monitoring network latency between shards.
    Temporarily disable the balancer to allow pending migrations to complete: `sh.stopBalancer(10000)` then re-enable after monitoring network latency between shards.
  2. 82% success Increase chunk size to reduce migration frequency: `db.adminCommand({ setClusterParameter: { chunkSize: 128 } })` and ensure source shard has sufficient I/O capacity.
    Increase chunk size to reduce migration frequency: `db.adminCommand({ setClusterParameter: { chunkSize: 128 } })` and ensure source shard has sufficient I/O capacity.
  3. 85% success Use `db.adminCommand({ moveChunk: 'mydb.mycoll', find: { shardKey: value }, to: 'targetShard', maxTimeMS: 60000 })` to manually migrate chunks with a longer timeout after verifying network health.
    Use `db.adminCommand({ moveChunk: 'mydb.mycoll', find: { shardKey: value }, to: 'targetShard', maxTimeMS: 60000 })` to manually migrate chunks with a longer timeout after verifying network health.

中文步骤

  1. Temporarily disable the balancer to allow pending migrations to complete: `sh.stopBalancer(10000)` then re-enable after monitoring network latency between shards.
  2. Increase chunk size to reduce migration frequency: `db.adminCommand({ setClusterParameter: { chunkSize: 128 } })` and ensure source shard has sufficient I/O capacity.
  3. Use `db.adminCommand({ moveChunk: 'mydb.mycoll', find: { shardKey: value }, to: 'targetShard', maxTimeMS: 60000 })` to manually migrate chunks with a longer timeout after verifying network health.

Dead Ends

Common approaches that don't work:

  1. 90% fail

    The timeout is caused by underlying network or resource issues, not stale balancer state; restarting only delays recovery and may lose migration progress.

  2. 95% fail

    The round-trip timeout is a hard-coded balancer limit, not a client-side query timeout; adjusting maxTimeMS does not affect internal balancer logic.

  3. 80% fail

    Stopping all chunk migrations leads to data imbalance, reduced query performance, and eventual cluster instability.