CHROMA-ERR-0042 llm data_error ai_generated true

chromadb.errors.InternalError: Index corruption detected. Rebuild required.

ID: llm/embedding-vector-index-corruption-after-reindex

Also available as: JSON · Markdown · 中文

82%Fix Rate

85%Confidence

1Evidence

2024-06-15First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
chromadb==0.4.22	active	—	—	—
chromadb==0.5.0	active	—	—	—
langchain-chroma==0.1.0	active	—	—	—

Root Cause

ChromaDB index files become corrupted when a reindex operation is interrupted by a crash or network disconnect, leaving the HNSW graph in an inconsistent state.

generic

中文

当重建索引操作因崩溃或网络断开而中断时，ChromaDB 索引文件损坏，导致 HNSW 图处于不一致状态。

Official Documentation

https://docs.trychroma.com/troubleshooting#index-corruption

Workarounds

95% success Identify the corrupted collection, delete it, and re-ingest the source documents: client.delete_collection('my_collection'); client.create_collection('my_collection'); then re-embed all documents. For production, maintain a backup of the source documents in a separate storage (e.g., S3) and a script to re-embed.
```
Identify the corrupted collection, delete it, and re-ingest the source documents: client.delete_collection('my_collection'); client.create_collection('my_collection'); then re-embed all documents. For production, maintain a backup of the source documents in a separate storage (e.g., S3) and a script to re-embed.
```
80% success Use ChromaDB's built-in persistence check: run 'chroma run --path /path/to/persist --debug' and look for 'HNSW index integrity check failed'. Then use the Python client to repair: collection._client._admin_client.reset_collection('my_collection') (requires admin access).
```
Use ChromaDB's built-in persistence check: run 'chroma run --path /path/to/persist --debug' and look for 'HNSW index integrity check failed'. Then use the Python client to repair: collection._client._admin_client.reset_collection('my_collection') (requires admin access).
```
75% success Set up a cron job to periodically validate index integrity using chromadb.api.types.validate_metadata and take a snapshot of the persistence directory before any reindex operation.
```
Set up a cron job to periodically validate index integrity using chromadb.api.types.validate_metadata and take a snapshot of the persistence directory before any reindex operation.
```

中文步骤

识别损坏的集合，删除它，然后重新摄取源文档：client.delete_collection('my_collection'); client.create_collection('my_collection'); 然后重新嵌入所有文档。对于生产环境，将源文档备份到独立存储（如 S3），并编写一个重新嵌入的脚本。

使用 ChromaDB 的内置持久性检查：运行 'chroma run --path /path/to/persist --debug' 并查找 'HNSW index integrity check failed'。然后使用 Python 客户端修复：collection._client._admin_client.reset_collection('my_collection')（需要管理员权限）。

设置一个 cron 任务，定期使用 chromadb.api.types.validate_metadata 验证索引完整性，并在任何重建索引操作之前对持久性目录进行快照。

Dead Ends

Common approaches that don't work:

95% fail
The corrupted HNSW graph persists on disk; restarting doesn't repair the structural damage, and the same corrupted files are loaded again.
98% fail
reset() wipes all data, not just the corrupted index, causing data loss for unrelated collections. It's a nuclear option that destroys all embeddings.
70% fail
If the original embedding source data is lost or not backed up, you cannot recreate the index. This only works if you have the raw documents and can re-embed them.