llm data_error ai_generated true

openai.BadRequestError: vector length must be 1 for cosine similarity

ID: llm/embedding-vector-normalization-mismatch

Also available as: JSON · Markdown · 中文
80%Fix Rate
85%Confidence
1Evidence
2023-11-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
openai==1.3.0 active
openai==1.12.0 active
text-embedding-ada-002 active
text-embedding-3-small active
text-embedding-3-large active

Root Cause

OpenAI's embedding API returns unit-normalized vectors by default, but custom embedding models or manual preprocessing may produce unnormalized vectors, causing cosine similarity computations to fail or return incorrect results.

generic

中文

OpenAI 的嵌入 API 默认返回单位归一化向量,但自定义嵌入模型或手动预处理可能产生未归一化的向量,导致余弦相似度计算失败或返回错误结果。

Official Documentation

https://platform.openai.com/docs/guides/embeddings/embedding-models

Workarounds

  1. 95% success Normalize vectors manually before insertion or query: `vector = vector / np.linalg.norm(vector)`
    Normalize vectors manually before insertion or query: `vector = vector / np.linalg.norm(vector)`
  2. 90% success Use OpenAI's default embeddings which are already normalized; avoid custom models or manual normalization unless necessary.
    Use OpenAI's default embeddings which are already normalized; avoid custom models or manual normalization unless necessary.
  3. 75% success Configure the vector database to use inner product distance instead of cosine similarity if supported (e.g., `metric='ip'` in Pinecone or Weaviate).
    Configure the vector database to use inner product distance instead of cosine similarity if supported (e.g., `metric='ip'` in Pinecone or Weaviate).

中文步骤

  1. 在插入或查询前手动归一化向量:`vector = vector / np.linalg.norm(vector)`
  2. 使用 OpenAI 默认的嵌入(已归一化),除非必要,否则避免自定义模型或手动归一化。
  3. 如果支持,将向量数据库配置为使用内积距离代替余弦相似度(例如,在 Pinecone 或 Weaviate 中设置 `metric='ip'`)。

Dead Ends

Common approaches that don't work:

  1. 65% fail

    Different embedding models produce vectors with different normalization properties; the root cause is not the model but the normalization step.

  2. 80% fail

    Dimension is unrelated to normalization; padding introduces noise and doesn't fix the length constraint.