llm
runtime_error
ai_generated
true
错误:处理流式数据块时超出上下文长度 — 返回部分响应
Error: context length exceeded while processing streaming chunks — partial response returned
ID: llm/context-window-exceeded-with-chunked-streaming
80%修复率
85%置信度
1证据数
2024-03-15首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| openai==1.12.0 | active | — | — | — |
| anthropic==0.25.0 | active | — | — | — |
| langchain==0.1.12 | active | — | — | — |
| gpt-4-turbo-2024-04-09 | active | — | — | — |
| claude-3-opus-20240229 | active | — | — | — |
根因分析
在流式处理期间,累积的输入和输出令牌超过了模型的上下文窗口,导致API在流中间截断响应,而没有明确的错误提示。
English
During streaming, cumulative input and output tokens exceed the model's context window, causing the API to truncate the response mid-stream without a clear error.
官方文档
https://platform.openai.com/docs/guides/rate-limits/error-mitigation解决方案
-
在流式处理前,使用tiktoken计算总令牌数(例如:`import tiktoken; enc = tiktoken.encoding_for_model('gpt-4'); tokens = enc.encode(prompt); if len(tokens) > 120000: truncate prompt`)。截断输入以为输出留出空间。 -
通过降低max_tokens来减少输出长度,并实现一个循环,在截断时从最后一个完整句子恢复生成。
无效尝试
常见但无效的做法:
-
85% 失败
Increasing max_tokens in the request doesn't help because the total (input + output) exceeds the model's limit, and max_tokens only caps output.
-
95% 失败
Retrying the same request with no changes will reproduce the error since the context is still too large.
-
90% 失败
Switching to a different streaming library (e.g., from openai to httpx) doesn't solve the underlying token limit issue.