llm
data_error
ai_generated
true
openai.BadRequestError: vector length must be 1 for cosine similarity
ID: llm/embedding-vector-normalization-mismatch
80%Fix Rate
85%Confidence
1Evidence
2023-11-15First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| openai==1.3.0 | active | — | — | — |
| openai==1.12.0 | active | — | — | — |
| text-embedding-ada-002 | active | — | — | — |
| text-embedding-3-small | active | — | — | — |
| text-embedding-3-large | active | — | — | — |
Root Cause
OpenAI's embedding API returns unit-normalized vectors by default, but custom embedding models or manual preprocessing may produce unnormalized vectors, causing cosine similarity computations to fail or return incorrect results.
generic中文
OpenAI 的嵌入 API 默认返回单位归一化向量,但自定义嵌入模型或手动预处理可能产生未归一化的向量,导致余弦相似度计算失败或返回错误结果。
Official Documentation
https://platform.openai.com/docs/guides/embeddings/embedding-modelsWorkarounds
-
95% success Normalize vectors manually before insertion or query: `vector = vector / np.linalg.norm(vector)`
Normalize vectors manually before insertion or query: `vector = vector / np.linalg.norm(vector)`
-
90% success Use OpenAI's default embeddings which are already normalized; avoid custom models or manual normalization unless necessary.
Use OpenAI's default embeddings which are already normalized; avoid custom models or manual normalization unless necessary.
-
75% success Configure the vector database to use inner product distance instead of cosine similarity if supported (e.g., `metric='ip'` in Pinecone or Weaviate).
Configure the vector database to use inner product distance instead of cosine similarity if supported (e.g., `metric='ip'` in Pinecone or Weaviate).
中文步骤
在插入或查询前手动归一化向量:`vector = vector / np.linalg.norm(vector)`
使用 OpenAI 默认的嵌入(已归一化),除非必要,否则避免自定义模型或手动归一化。
如果支持,将向量数据库配置为使用内积距离代替余弦相似度(例如,在 Pinecone 或 Weaviate 中设置 `metric='ip'`)。
Dead Ends
Common approaches that don't work:
-
65% fail
Different embedding models produce vectors with different normalization properties; the root cause is not the model but the normalization step.
-
80% fail
Dimension is unrelated to normalization; padding introduces noise and doesn't fix the length constraint.