# LLM hallucinates values for optional fields in structured output when field is missing from context

- **ID:** `llm/structured-output-hallucination-on-null-fields`
- **Domain:** llm
- **Category:** data_error
- **Verification:** ai_generated
- **Fix Rate:** 82%

## Root Cause

When using response_format with a Pydantic schema that has Optional fields, the model invents plausible but incorrect values instead of omitting or setting null, because the instruction to omit is not strong enough for the model's autoregressive generation.

## Version Compatibility

| Version | Status | Introduced | Deprecated |
|---------|--------|------------|------------|
| openai==1.30.0 | active | — | — |
| pydantic==2.7.0 | active | — | — |
| langchain==0.2.0 | active | — | — |

## Workarounds

1. **Post-process the LLM output: after receiving the structured response, iterate over all fields that were Optional in the schema and if the original input context does not contain evidence for that field, set it to None. Example:

def clean_optional_fields(response: dict, schema: Type[BaseModel], context: str) -> dict:
    for field_name, field in schema.model_fields.items():
        if field.is_required():
            continue
        if field_name not in context:
            response[field_name] = None
    return response** (92% success)
   ```
   Post-process the LLM output: after receiving the structured response, iterate over all fields that were Optional in the schema and if the original input context does not contain evidence for that field, set it to None. Example:

def clean_optional_fields(response: dict, schema: Type[BaseModel], context: str) -> dict:
    for field_name, field in schema.model_fields.items():
        if field.is_required():
            continue
        if field_name not in context:
            response[field_name] = None
    return response
   ```
2. **Modify the system prompt to explicitly instruct the model: 'If information for an optional field is not present in the provided context, set that field to null. Do not invent values.'** (78% success)
   ```
   Modify the system prompt to explicitly instruct the model: 'If information for an optional field is not present in the provided context, set that field to null. Do not invent values.'
   ```
3. **Use constrained decoding libraries like 'outlines' or 'lm-format-enforcer' that enforce null tokens for optional fields at generation time, rather than relying on post-hoc JSON parsing.** (95% success)
   ```
   Use constrained decoding libraries like 'outlines' or 'lm-format-enforcer' that enforce null tokens for optional fields at generation time, rather than relying on post-hoc JSON parsing.
   ```

## Dead Ends

- **Setting temperature=0 to force determinism** — Temperature=0 reduces randomness but does not change the model's autoregressive tendency to fill in plausible values. It still hallucinates optional fields. (75% fail)
- **Adding 'strict=True' to response_format schema** — Strict mode only enforces JSON schema compliance (e.g., no extra fields), it does not prevent the model from generating values for optional fields when context is missing. (90% fail)
- **Using a simpler schema with all required fields** — This removes the optionality, forcing the model to always provide a value, which makes the hallucination problem worse, not better. (60% fail)
