LLAMA-ERR-0091 llm runtime_error ai_generated partial

llama_index.core.ingestion.pipeline.IngestionCacheMiss: Cache miss for node 'node_abc123'. Re-processing.

ID: llm/llama-index-pipeline-cache-miss

Also available as: JSON · Markdown · 中文
78%Fix Rate
82%Confidence
1Evidence
2024-09-20First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
llama-index==0.10.43 active
llama-index-core==0.11.0 active

Root Cause

LlamaIndex ingestion pipeline cache invalidation occurs when the document hash changes (e.g., due to metadata updates or text normalization), causing the cache to skip previously processed nodes and re-run expensive embedding and chunking steps.

generic

中文

当文档哈希值发生变化时(例如,由于元数据更新或文本规范化),LlamaIndex 摄取管道的缓存失效,导致缓存跳过已处理的节点,并重新运行昂贵的嵌入和分块步骤。

Official Documentation

https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline.html#caching

Workarounds

  1. 85% success Set a stable document hash by normalizing text before ingestion: use `pipeline.add_documents(documents, hash_ids=True)` and ensure document texts are normalized (e.g., lowercased, whitespace trimmed) before adding to the pipeline. Example: `from llama_index.core.node_parser import SimpleNodeParser; parser = SimpleNodeParser.from_defaults(); nodes = parser.get_nodes_from_documents(docs); pipeline.run(nodes=nodes, in_place=True)`.
    Set a stable document hash by normalizing text before ingestion: use `pipeline.add_documents(documents, hash_ids=True)` and ensure document texts are normalized (e.g., lowercased, whitespace trimmed) before adding to the pipeline. Example: `from llama_index.core.node_parser import SimpleNodeParser; parser = SimpleNodeParser.from_defaults(); nodes = parser.get_nodes_from_documents(docs); pipeline.run(nodes=nodes, in_place=True)`.
  2. 80% success Use a persistent cache directory outside the project folder: `pipeline = IngestionPipeline(cache=IngestionCache(persist_path='/data/cache/ingestion_cache'))` to avoid cache being wiped during deployments.
    Use a persistent cache directory outside the project folder: `pipeline = IngestionPipeline(cache=IngestionCache(persist_path='/data/cache/ingestion_cache'))` to avoid cache being wiped during deployments.
  3. 70% success Implement a custom cache key function by subclassing IngestionCache and overriding the `_get_cache_key` method to ignore metadata fields like 'last_modified' or 'version'.
    Implement a custom cache key function by subclassing IngestionCache and overriding the `_get_cache_key` method to ignore metadata fields like 'last_modified' or 'version'.

中文步骤

  1. 通过在摄取前规范化文本来设置稳定的文档哈希:使用 `pipeline.add_documents(documents, hash_ids=True)` 并确保在添加到管道前将文档文本规范化(例如,小写化、去除空白)。示例:`from llama_index.core.node_parser import SimpleNodeParser; parser = SimpleNodeParser.from_defaults(); nodes = parser.get_nodes_from_documents(docs); pipeline.run(nodes=nodes, in_place=True)`。
  2. 使用项目文件夹外部的持久缓存目录:`pipeline = IngestionPipeline(cache=IngestionCache(persist_path='/data/cache/ingestion_cache'))` 以避免在部署期间缓存被清除。
  3. 通过继承 IngestionCache 并重写 `_get_cache_key` 方法来实现自定义缓存键函数,以忽略 'last_modified' 或 'version' 等元数据字段。

Dead Ends

Common approaches that don't work:

  1. 90% fail

    This eliminates all performance benefits of caching and causes the pipeline to re-process every document on every run, which is impractical for large datasets.

  2. 85% fail

    This is a temporary fix that doesn't address the root cause (hash changes). The cache will miss again on the next run if the document source is still being modified.

  3. 95% fail

    Custom hash functions are not supported in the current LlamaIndex cache implementation; attempting to override requires monkey-patching internal methods, which breaks on version updates.