llama_index.core.ingestion.pipeline.IngestionCacheMiss: 节点 'node_abc123' 缓存未命中,正在重新处理。
llama_index.core.ingestion.pipeline.IngestionCacheMiss: Cache miss for node 'node_abc123'. Re-processing.
ID: llm/llama-index-pipeline-cache-miss
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| llama-index==0.10.43 | active | — | — | — |
| llama-index-core==0.11.0 | active | — | — | — |
根因分析
当文档哈希值发生变化时(例如,由于元数据更新或文本规范化),LlamaIndex 摄取管道的缓存失效,导致缓存跳过已处理的节点,并重新运行昂贵的嵌入和分块步骤。
English
LlamaIndex ingestion pipeline cache invalidation occurs when the document hash changes (e.g., due to metadata updates or text normalization), causing the cache to skip previously processed nodes and re-run expensive embedding and chunking steps.
官方文档
https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline.html#caching解决方案
-
通过在摄取前规范化文本来设置稳定的文档哈希:使用 `pipeline.add_documents(documents, hash_ids=True)` 并确保在添加到管道前将文档文本规范化(例如,小写化、去除空白)。示例:`from llama_index.core.node_parser import SimpleNodeParser; parser = SimpleNodeParser.from_defaults(); nodes = parser.get_nodes_from_documents(docs); pipeline.run(nodes=nodes, in_place=True)`。
-
使用项目文件夹外部的持久缓存目录:`pipeline = IngestionPipeline(cache=IngestionCache(persist_path='/data/cache/ingestion_cache'))` 以避免在部署期间缓存被清除。
-
通过继承 IngestionCache 并重写 `_get_cache_key` 方法来实现自定义缓存键函数,以忽略 'last_modified' 或 'version' 等元数据字段。
无效尝试
常见但无效的做法:
-
90% 失败
This eliminates all performance benefits of caching and causes the pipeline to re-process every document on every run, which is impractical for large datasets.
-
85% 失败
This is a temporary fix that doesn't address the root cause (hash changes). The cache will miss again on the next run if the document source is still being modified.
-
95% 失败
Custom hash functions are not supported in the current LlamaIndex cache implementation; attempting to override requires monkey-patching internal methods, which breaks on version updates.