当可选字段在上下文中缺失时,LLM 会为结构化输出中的可选字段生成虚假值
LLM hallucinates values for optional fields in structured output when field is missing from context
ID: llm/structured-output-hallucination-on-null-fields
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| openai==1.30.0 | active | — | — | — |
| pydantic==2.7.0 | active | — | — | — |
| langchain==0.2.0 | active | — | — | — |
根因分析
当使用带有可选字段的 Pydantic 模式的 response_format 时,模型会虚构看似合理但不正确的值,而不是省略或设置为 null,因为省略指令对模型的自回归生成不够强烈。
English
When using response_format with a Pydantic schema that has Optional fields, the model invents plausible but incorrect values instead of omitting or setting null, because the instruction to omit is not strong enough for the model's autoregressive generation.
官方文档
https://platform.openai.com/docs/guides/structured-outputs#optional-fields解决方案
-
后处理 LLM 输出:收到结构化响应后,遍历模式中所有 Optional 字段,如果原始输入上下文中不包含该字段的证据,则将其设置为 None。示例: def clean_optional_fields(response: dict, schema: Type[BaseModel], context: str) -> dict: for field_name, field in schema.model_fields.items(): if field.is_required(): continue if field_name not in context: response[field_name] = None return response -
修改系统提示,明确指示模型:'如果提供的上下文中没有可选字段的信息,请将该字段设置为 null。不要虚构值。'
-
使用受限解码库(如 'outlines' 或 'lm-format-enforcer'),在生成时强制可选字段为 null 标记,而不是依赖事后 JSON 解析。
无效尝试
常见但无效的做法:
-
Setting temperature=0 to force determinism
75% 失败
Temperature=0 reduces randomness but does not change the model's autoregressive tendency to fill in plausible values. It still hallucinates optional fields.
-
Adding 'strict=True' to response_format schema
90% 失败
Strict mode only enforces JSON schema compliance (e.g., no extra fields), it does not prevent the model from generating values for optional fields when context is missing.
-
Using a simpler schema with all required fields
60% 失败
This removes the optionality, forcing the model to always provide a value, which makes the hallucination problem worse, not better.