{
  "id": "llm/structured-output-hallucination-on-null-fields",
  "signature": "LLM hallucinates values for optional fields in structured output when field is missing from context",
  "signature_zh": "当可选字段在上下文中缺失时，LLM 会为结构化输出中的可选字段生成虚假值",
  "regex": "LLM hallucinates values for optional fields.*response_format.*Optional",
  "domain": "llm",
  "category": "data_error",
  "subcategory": null,
  "root_cause": "When using response_format with a Pydantic schema that has Optional fields, the model invents plausible but incorrect values instead of omitting or setting null, because the instruction to omit is not strong enough for the model's autoregressive generation.",
  "root_cause_type": "generic",
  "root_cause_zh": "当使用带有可选字段的 Pydantic 模式的 response_format 时，模型会虚构看似合理但不正确的值，而不是省略或设置为 null，因为省略指令对模型的自回归生成不够强烈。",
  "versions": [
    {
      "version": "openai==1.30.0",
      "introduced": null,
      "deprecated": null,
      "removed": null,
      "behavior_change": null,
      "status": "active"
    },
    {
      "version": "pydantic==2.7.0",
      "introduced": null,
      "deprecated": null,
      "removed": null,
      "behavior_change": null,
      "status": "active"
    },
    {
      "version": "langchain==0.2.0",
      "introduced": null,
      "deprecated": null,
      "removed": null,
      "behavior_change": null,
      "status": "active"
    }
  ],
  "os_specific": {},
  "dead_ends": [
    {
      "action": "Setting temperature=0 to force determinism",
      "why_fails": "Temperature=0 reduces randomness but does not change the model's autoregressive tendency to fill in plausible values. It still hallucinates optional fields.",
      "fail_rate": 0.75,
      "condition": "",
      "sources": []
    },
    {
      "action": "Adding 'strict=True' to response_format schema",
      "why_fails": "Strict mode only enforces JSON schema compliance (e.g., no extra fields), it does not prevent the model from generating values for optional fields when context is missing.",
      "fail_rate": 0.9,
      "condition": "",
      "sources": []
    },
    {
      "action": "Using a simpler schema with all required fields",
      "why_fails": "This removes the optionality, forcing the model to always provide a value, which makes the hallucination problem worse, not better.",
      "fail_rate": 0.6,
      "condition": "",
      "sources": []
    }
  ],
  "workarounds": [
    {
      "action": "Post-process the LLM output: after receiving the structured response, iterate over all fields that were Optional in the schema and if the original input context does not contain evidence for that field, set it to None. Example:\n\ndef clean_optional_fields(response: dict, schema: Type[BaseModel], context: str) -> dict:\n    for field_name, field in schema.model_fields.items():\n        if field.is_required():\n            continue\n        if field_name not in context:\n            response[field_name] = None\n    return response",
      "success_rate": 0.92,
      "how": "Post-process the LLM output: after receiving the structured response, iterate over all fields that were Optional in the schema and if the original input context does not contain evidence for that field, set it to None. Example:\n\ndef clean_optional_fields(response: dict, schema: Type[BaseModel], context: str) -> dict:\n    for field_name, field in schema.model_fields.items():\n        if field.is_required():\n            continue\n        if field_name not in context:\n            response[field_name] = None\n    return response",
      "condition": "",
      "sources": []
    },
    {
      "action": "Modify the system prompt to explicitly instruct the model: 'If information for an optional field is not present in the provided context, set that field to null. Do not invent values.'",
      "success_rate": 0.78,
      "how": "Modify the system prompt to explicitly instruct the model: 'If information for an optional field is not present in the provided context, set that field to null. Do not invent values.'",
      "condition": "",
      "sources": []
    },
    {
      "action": "Use constrained decoding libraries like 'outlines' or 'lm-format-enforcer' that enforce null tokens for optional fields at generation time, rather than relying on post-hoc JSON parsing.",
      "success_rate": 0.95,
      "how": "Use constrained decoding libraries like 'outlines' or 'lm-format-enforcer' that enforce null tokens for optional fields at generation time, rather than relying on post-hoc JSON parsing.",
      "condition": "",
      "sources": []
    }
  ],
  "workarounds_zh": [
    "后处理 LLM 输出：收到结构化响应后，遍历模式中所有 Optional 字段，如果原始输入上下文中不包含该字段的证据，则将其设置为 None。示例：\n\ndef clean_optional_fields(response: dict, schema: Type[BaseModel], context: str) -> dict:\n    for field_name, field in schema.model_fields.items():\n        if field.is_required():\n            continue\n        if field_name not in context:\n            response[field_name] = None\n    return response",
    "修改系统提示，明确指示模型：'如果提供的上下文中没有可选字段的信息，请将该字段设置为 null。不要虚构值。'",
    "使用受限解码库（如 'outlines' 或 'lm-format-enforcer'），在生成时强制可选字段为 null 标记，而不是依赖事后 JSON 解析。"
  ],
  "transition_graph": {
    "leads_to": [],
    "preceded_by": [],
    "frequently_confused_with": []
  },
  "official_doc_url": "https://platform.openai.com/docs/guides/structured-outputs#optional-fields",
  "official_doc_section": null,
  "error_code": null,
  "verification_tier": "ai_generated",
  "confidence": 0.88,
  "fix_success_rate": 0.82,
  "resolvable": "partial",
  "first_seen": "2024-03-15",
  "last_confirmed": "2024-06-01",
  "last_updated": "2024-06-01",
  "evidence_count": 1,
  "tags": [],
  "locale": "en",
  "aliases": []
}