llm data_error ai_generated partial

警告：嵌入模型'text-embedding-3-small'的输入文本被截断为8192个令牌——嵌入质量可能下降

Warning: Input text truncated to 8192 tokens for embedding model 'text-embedding-3-small' — embedding quality may degrade

ID: llm/embedding-truncation-mismatch

其他格式: JSON · Markdown 中文 · English

80%修复率

85%置信度

1证据数

2024-02-20首次发现

版本兼容性

版本	状态	引入	弃用	备注
openai>=1.0.0	active	—	—	—
text-embedding-3-small	active	—	—	—
text-embedding-3-large	active	—	—	—
text-embedding-ada-002	active	—	—	—

根因分析

嵌入模型有最大输入令牌限制（例如 text-embedding-3-small 为 8192）；长输入会被静默截断，丢失文本末尾的语义信息。

English

Embedding models have a maximum input token limit (e.g., 8192 for text-embedding-3-small); longer inputs are silently truncated, losing semantic information at the end of the text.

generic

官方文档

https://platform.openai.com/docs/guides/embeddings/embedding-models

解决方案

在发送到 API 之前，使用相同的分词器（例如 OpenAI 模型的 tiktoken）将输入文本预截断到模型的令牌限制，并显式记录截断。

使用滑动窗口或分块策略：将长文档分割成 max_tokens 的重叠块，分别嵌入每个块，并将所有嵌入连同元数据存储用于检索。

对于 RAG 管道，优先嵌入文本中语义最重要的部分（例如开头和关键部分），而不是依赖对末尾的自动截断。

无效尝试

常见但无效的做法:

95% 失败
The embedding API does not accept a max_tokens parameter; truncation is automatic and controlled by model limits
70% 失败
Averaging embeddings from different chunks loses positional and semantic relationships; not equivalent to a single embedding of the full text
85% 失败
Truncated embeddings miss critical information from the end of the text, leading to poor retrieval quality in RAG systems