llm data_error ai_generated partial

LLM hallucinates values for optional fields in structured output when field is missing from context

ID: llm/structured-output-hallucination-on-null-fields

Also available as: JSON · Markdown · 中文
82%Fix Rate
88%Confidence
1Evidence
2024-03-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
openai==1.30.0 active
pydantic==2.7.0 active
langchain==0.2.0 active

Root Cause

When using response_format with a Pydantic schema that has Optional fields, the model invents plausible but incorrect values instead of omitting or setting null, because the instruction to omit is not strong enough for the model's autoregressive generation.

generic

中文

当使用带有可选字段的 Pydantic 模式的 response_format 时,模型会虚构看似合理但不正确的值,而不是省略或设置为 null,因为省略指令对模型的自回归生成不够强烈。

Official Documentation

https://platform.openai.com/docs/guides/structured-outputs#optional-fields

Workarounds

  1. 92% success Post-process the LLM output: after receiving the structured response, iterate over all fields that were Optional in the schema and if the original input context does not contain evidence for that field, set it to None. Example: def clean_optional_fields(response: dict, schema: Type[BaseModel], context: str) -> dict: for field_name, field in schema.model_fields.items(): if field.is_required(): continue if field_name not in context: response[field_name] = None return response
    Post-process the LLM output: after receiving the structured response, iterate over all fields that were Optional in the schema and if the original input context does not contain evidence for that field, set it to None. Example:
    
    def clean_optional_fields(response: dict, schema: Type[BaseModel], context: str) -> dict:
        for field_name, field in schema.model_fields.items():
            if field.is_required():
                continue
            if field_name not in context:
                response[field_name] = None
        return response
  2. 78% success Modify the system prompt to explicitly instruct the model: 'If information for an optional field is not present in the provided context, set that field to null. Do not invent values.'
    Modify the system prompt to explicitly instruct the model: 'If information for an optional field is not present in the provided context, set that field to null. Do not invent values.'
  3. 95% success Use constrained decoding libraries like 'outlines' or 'lm-format-enforcer' that enforce null tokens for optional fields at generation time, rather than relying on post-hoc JSON parsing.
    Use constrained decoding libraries like 'outlines' or 'lm-format-enforcer' that enforce null tokens for optional fields at generation time, rather than relying on post-hoc JSON parsing.

中文步骤

  1. 后处理 LLM 输出:收到结构化响应后,遍历模式中所有 Optional 字段,如果原始输入上下文中不包含该字段的证据,则将其设置为 None。示例:
    
    def clean_optional_fields(response: dict, schema: Type[BaseModel], context: str) -> dict:
        for field_name, field in schema.model_fields.items():
            if field.is_required():
                continue
            if field_name not in context:
                response[field_name] = None
        return response
  2. 修改系统提示,明确指示模型:'如果提供的上下文中没有可选字段的信息,请将该字段设置为 null。不要虚构值。'
  3. 使用受限解码库(如 'outlines' 或 'lm-format-enforcer'),在生成时强制可选字段为 null 标记,而不是依赖事后 JSON 解析。

Dead Ends

Common approaches that don't work:

  1. Setting temperature=0 to force determinism 75% fail

    Temperature=0 reduces randomness but does not change the model's autoregressive tendency to fill in plausible values. It still hallucinates optional fields.

  2. Adding 'strict=True' to response_format schema 90% fail

    Strict mode only enforces JSON schema compliance (e.g., no extra fields), it does not prevent the model from generating values for optional fields when context is missing.

  3. Using a simpler schema with all required fields 60% fail

    This removes the optionality, forcing the model to always provide a value, which makes the hallucination problem worse, not better.