llm data_error ai_generated partial

当可选字段在上下文中缺失时，LLM 会为结构化输出中的可选字段生成虚假值

LLM hallucinates values for optional fields in structured output when field is missing from context

ID: llm/structured-output-hallucination-on-null-fields

其他格式: JSON · Markdown 中文 · English

82%修复率

88%置信度

1证据数

2024-03-15首次发现

版本兼容性

版本	状态	引入	弃用	备注
openai==1.30.0	active	—	—	—
pydantic==2.7.0	active	—	—	—
langchain==0.2.0	active	—	—	—

根因分析

当使用带有可选字段的 Pydantic 模式的 response_format 时，模型会虚构看似合理但不正确的值，而不是省略或设置为 null，因为省略指令对模型的自回归生成不够强烈。

English

When using response_format with a Pydantic schema that has Optional fields, the model invents plausible but incorrect values instead of omitting or setting null, because the instruction to omit is not strong enough for the model's autoregressive generation.

generic

官方文档

https://platform.openai.com/docs/guides/structured-outputs#optional-fields

解决方案

后处理 LLM 输出：收到结构化响应后，遍历模式中所有 Optional 字段，如果原始输入上下文中不包含该字段的证据，则将其设置为 None。示例：

def clean_optional_fields(response: dict, schema: Type[BaseModel], context: str) -> dict:
    for field_name, field in schema.model_fields.items():
        if field.is_required():
            continue
        if field_name not in context:
            response[field_name] = None
    return response

修改系统提示，明确指示模型：'如果提供的上下文中没有可选字段的信息，请将该字段设置为 null。不要虚构值。'

使用受限解码库（如 'outlines' 或 'lm-format-enforcer'），在生成时强制可选字段为 null 标记，而不是依赖事后 JSON 解析。

无效尝试

常见但无效的做法:

Setting temperature=0 to force determinism 75% 失败
Temperature=0 reduces randomness but does not change the model's autoregressive tendency to fill in plausible values. It still hallucinates optional fields.
Adding 'strict=True' to response_format schema 90% 失败
Strict mode only enforces JSON schema compliance (e.g., no extra fields), it does not prevent the model from generating values for optional fields when context is missing.
Using a simpler schema with all required fields 60% 失败
This removes the optionality, forcing the model to always provide a value, which makes the hallucination problem worse, not better.