llm runtime_error ai_generated true

Error: context length exceeded while processing streaming chunks — partial response returned

ID: llm/context-window-exceeded-with-chunked-streaming

Also available as: JSON · Markdown · 中文
80%Fix Rate
85%Confidence
1Evidence
2024-03-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
openai==1.12.0 active
anthropic==0.25.0 active
langchain==0.1.12 active
gpt-4-turbo-2024-04-09 active
claude-3-opus-20240229 active

Root Cause

During streaming, cumulative input and output tokens exceed the model's context window, causing the API to truncate the response mid-stream without a clear error.

generic

中文

在流式处理期间,累积的输入和输出令牌超过了模型的上下文窗口,导致API在流中间截断响应,而没有明确的错误提示。

Official Documentation

https://platform.openai.com/docs/guides/rate-limits/error-mitigation

Workarounds

  1. 85% success Before streaming, calculate total tokens using tiktoken (e.g., `import tiktoken; enc = tiktoken.encoding_for_model('gpt-4'); tokens = enc.encode(prompt); if len(tokens) > 120000: truncate prompt`). Truncate the input to leave room for output.
    Before streaming, calculate total tokens using tiktoken (e.g., `import tiktoken; enc = tiktoken.encoding_for_model('gpt-4'); tokens = enc.encode(prompt); if len(tokens) > 120000: truncate prompt`). Truncate the input to leave room for output.
  2. 75% success Reduce the output length by lowering max_tokens, and implement a loop to resume generation from the last complete sentence if truncated.
    Reduce the output length by lowering max_tokens, and implement a loop to resume generation from the last complete sentence if truncated.

中文步骤

  1. 在流式处理前,使用tiktoken计算总令牌数(例如:`import tiktoken; enc = tiktoken.encoding_for_model('gpt-4'); tokens = enc.encode(prompt); if len(tokens) > 120000: truncate prompt`)。截断输入以为输出留出空间。
  2. 通过降低max_tokens来减少输出长度,并实现一个循环,在截断时从最后一个完整句子恢复生成。

Dead Ends

Common approaches that don't work:

  1. 85% fail

    Increasing max_tokens in the request doesn't help because the total (input + output) exceeds the model's limit, and max_tokens only caps output.

  2. 95% fail

    Retrying the same request with no changes will reproduce the error since the context is still too large.

  3. 90% fail

    Switching to a different streaming library (e.g., from openai to httpx) doesn't solve the underlying token limit issue.