llm data_error ai_generated partial

验证错误：ResponseModel的color字段应为'red'、'green'或'blue'，但收到'purple'。

ValidationError: 1 validation error for ResponseModel color Input should be 'red', 'green', or 'blue' [type=enum, input_value='purple', input_type=str]

ID: llm/llm-structured-output-enum-violation-streaming

其他格式: JSON · Markdown 中文 · English

75%修复率

82%置信度

1证据数

2024-04-05首次发现

版本兼容性

版本	状态	引入	弃用	备注
openai 1.12.0	active	—	—	—
openai 1.13.0	active	—	—	—
pydantic 2.5.0	active	—	—	—

根因分析

在流式处理中使用结构化输出时，由于部分令牌生成期间约束执行不完整，LLM生成超出允许集合的枚举值。

English

LLM generates enum values outside the allowed set when using structured output with streaming, due to incomplete constraint enforcement during partial token generation.

generic

官方文档

https://platform.openai.com/docs/guides/structured-outputs

解决方案

Use post-processing to map invalid values to nearest valid enum: valid_colors = {'red','green','blue'}; if output.color not in valid_colors: output.color = 'blue'  # fallback

Switch to non-streaming mode for structured outputs: response = client.chat.completions.create(model='gpt-4', response_format={'type':'json_object'}, stream=False)

无效尝试

常见但无效的做法:

Setting temperature to 0 to reduce randomness 80% 失败
Enum violations occur due to token-level decoding constraints, not sampling randomness.
Increasing max_tokens hoping for complete output 90% 失败
More tokens don't fix constraint enforcement; the model still generates invalid values.