Similarity Search Score in VectorStore #23948

silwork99 · 2024-07-07T12:44:47Z

silwork99
Jul 7, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

db = Chroma.from_documents(docs, embedding_function, persist_directory="CV_Data5_without_edu/")
from typing import Any, Dict, List
from langchain_core.documents import Document
from langchain_core.runnables import chain

class CustomSelfQueryRetriever(SelfQueryRetriever):
    def _get_docs_with_query(
        self, query: str, search_kwargs={"k": 2600}
    ) -> List[Document]:
        """Get docs, adding score information."""
        docs, scores = zip(
            *db.similarity_search_with_score(query, **search_kwargs)
        )
        for doc, score in zip(docs, scores):
            doc.metadata["score"] = score

        return docs
chat = ChatOpenAI(temperature=0, model_name="gpt-4o")
retriever = CustomSelfQueryRetriever.from_llm(
    chat,
    db,
    document_content_description,
    metadata_field_info,
    max_tokens_limit=3700,
)
retriever.invoke('Employment')

Description

Hello, I have built a vector store of more than 2500 documents and I want to calculate the similarity score of a query wit the whole documents in the data not only the most relevant.
I have used Chroma, Faiss and Pinecone and all of them returns only 4 or 5 scores maximum of the most relevant data.
I have tried to set the k = 2500+ which is the size of the data but nothing has changed
How should I do to get the scores of each document in the data?
Thank you.

System Info

python -m langchain_core.sys_info

@silwork99 · 2024-07-07T12:49:05Z

dosubot[bot]
bot Jul 7, 2024

Hello, @silwork99! I'm here to help you with your issue. Let's figure this out together!

To get the similarity scores for each document in your dataset using Chroma, Faiss, or Pinecone, you can use the respective methods that return documents along with their similarity scores. Here are examples for each:

Using Chroma

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma("langchain_store", embeddings)

# Define your query embedding
query_embedding = [0.1, 0.2, 0.3, 0.4]  # Example embedding

# Get similarity scores for each document
results = vectorstore.similarity_search_by_vector_with_relevance_scores(
    embedding=query_embedding,
    k=2500  # Number of results to return
)

# Print the results
for doc, score in results:
    print(f"Document: {doc.page_content}, Score: {score}")

Using Faiss

from langchain_community.embeddings.openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings()
texts = ["Document 1 content", "Document 2 content", "Document 3 content"]
faiss_store = FAISS.from_texts(texts, embeddings)

# Define your query embedding
query_embedding = [0.1, 0.2, 0.3, 0.4]  # Example embedding

# Get similarity scores for each document
results = faiss_store.similarity_search_with_score_by_vector(
    embedding=query_embedding,
    k=2500  # Number of results to return
)

# Print the results
for doc, score in results:
    print(f"Document: {doc.page_content}, Score: {score}")

Using Pinecone

from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings

# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings()
index_name = "my-index"
namespace = "my-namespace"
vectorstore = PineconeVectorStore(
    index_name=index_name,
    embedding=embeddings,
    namespace=namespace,
)

# Define your query embedding
query_embedding = [0.1, 0.2, 0.3, 0.4]  # Example embedding

# Get similarity scores for each document
results = vectorstore.similarity_search_by_vector_with_score(
    embedding=query_embedding,
    k=2500  # Number of results to return
)

# Print the results
for doc, score in results:
    print(f"Document: {doc.page_content}, Score: {score}")

These methods will return the documents along with their similarity scores, allowing you to see the relevance of each document in your dataset ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

2 replies

silwork99 Jul 7, 2024
Author

@dosu

dosubot[bot] bot Jul 7, 2024

To get the similarity scores for each document in your dataset using Chroma, Faiss, or Pinecone, you can use the respective methods that return documents along with their similarity scores. Here are examples for each:

Using Chroma

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma("langchain_store", embeddings)

# Define your query embedding
query_embedding = [0.1, 0.2, 0.3, 0.4]  # Example embedding

# Get similarity scores for each document
results = vectorstore.similarity_search_by_vector_with_relevance_scores(
    embedding=query_embedding,
    k=2500  # Number of results to return
)

# Print the results
for doc, score in results:
    print(f"Document: {doc.page_content}, Score: {score}")

Using Faiss

from langchain_community.embeddings.openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings()
texts = ["Document 1 content", "Document 2 content", "Document 3 content"]
faiss_store = FAISS.from_texts(texts, embeddings)

# Define your query embedding
query_embedding = [0.1, 0.2, 0.3, 0.4]  # Example embedding

# Get similarity scores for each document
results = faiss_store.similarity_search_with_score_by_vector(
    embedding=query_embedding,
    k=2500  # Number of results to return
)

# Print the results
for doc, score in results:
    print(f"Document: {doc.page_content}, Score: {score}")

Using Pinecone

from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings

# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings()
index_name = "my-index"
namespace = "my-namespace"
vectorstore = PineconeVectorStore(
    index_name=index_name,
    embedding=embeddings,
    namespace=namespace,
)

# Define your query embedding
query_embedding = [0.1, 0.2, 0.3, 0.4]  # Example embedding

# Get similarity scores for each document
results = vectorstore.similarity_search_by_vector_with_score(
    embedding=query_embedding,
    k=2500  # Number of results to return
)

# Print the results
for doc, score in results:
    print(f"Document: {doc.page_content}, Score: {score}")

These methods will return the documents along with their similarity scores, allowing you to see the relevance of each document in your dataset ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

silwork99 · 2024-07-07T17:59:41Z

silwork99
Jul 7, 2024
Author

@dosu
Please after using relevance_scores with every vectorstore, is the smallest score means higher similarity or what?

12 replies

silwork99 Jul 7, 2024
Author

how to make chrome, faiss and pine cone use cosine similarity instead of cosine distance?
@dosu

dosubot[bot] bot Jul 7, 2024

To configure Chroma, Faiss, and Pinecone to use cosine similarity instead of cosine distance, you can follow these steps:

Chroma

For Chroma, you can set the distance metric to cosine when creating a collection. The default distance in Chroma is l2, but you can change it to use cosine distance by specifying the collection_metadata parameter. Here is an example:

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(
    collection_name="my_collection",
    documents=docs,
    embedding=embeddings,
    persist_directory="path_to_directory",
    collection_metadata={"hnsw:space": "cosine"}
)

# Perform similarity search with score
results = vectorstore.similarity_search_with_score(query)

# Convert cosine distance to cosine similarity
docs_with_similarity = [(doc, 1 - score) for doc, score in results]

# Print the results
for doc, similarity in docs_with_similarity:
    print(f"Document: {doc.page_content}, Similarity: {similarity}")

Faiss

For Faiss, you need to retrieve the distance scores and then convert them to similarity scores by subtracting from 1. Here is an example:

from langchain_community.embeddings.openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings()
texts = ["Document 1 content", "Document 2 content", "Document 3 content"]
faiss_store = FAISS.from_texts(texts, embeddings, distance_strategy="COSINE")

# Define your query embedding
query_embedding = [0.1, 0.2, 0.3, 0.4]  # Example embedding

# Get similarity scores for each document
results = faiss_store.similarity_search_with_score_by_vector(
    embedding=query_embedding,
    k=2500  # Number of results to return
)

# Convert cosine distance to cosine similarity
docs_with_similarity = [(doc, 1 - score) for doc, score in results]

# Print the results
for doc, similarity in docs_with_similarity:
    print(f"Document: {doc.page_content}, Similarity: {similarity}")

Pinecone

For Pinecone, you can set the distance_strategy to DistanceStrategy.COSINE when initializing the Pinecone vector store. Pinecone returns cosine similarity scores between [-1, 1], so you can convert these scores to a [0, 1] range by using the formula (score + 1) / 2.

from langchain_community.vectorstores.utils import DistanceStrategy
from langchain_community.vectorstores.pinecone import Pinecone

# Initialize Pinecone with cosine distance strategy
pinecone_index = Pinecone(
    index=index,
    embedding=embedding,
    text_key="text",
    distance_strategy=DistanceStrategy.COSINE
)

# Perform similarity search with score
results = pinecone_index.similarity_search_with_score(query)

# Convert cosine similarity score to [0, 1] range
docs_with_similarity = [(doc, (score + 1) / 2) for doc, score in results]

# Print the results
for doc, similarity in docs_with_similarity:
    print(f"Document: {doc.page_content}, Similarity: {similarity}")

By following these steps, you can configure Chroma, Faiss, and Pinecone to use cosine similarity for calculating similarity scores, where a higher score indicates higher similarity ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

silwork99 Jul 7, 2024
Author

so now higher scores indicates higher similarity in both similarity_search_with_score and similarity_search_by_vector_with_relevance_scores functions?

dosubot[bot] bot Jul 7, 2024

Yes, after configuring Chroma, Faiss, and Pinecone to use cosine similarity instead of cosine distance, higher scores indicate higher similarity in both the similarity_search_with_score and similarity_search_by_vector_with_relevance_scores functions ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

ohjho Oct 15, 2025

similarly how to make DocArrayInMemorySearch use cosine similarity instead of cosine distance? @dosu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Similarity Search Score in VectorStore #23948

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 14 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Similarity Search Score in VectorStore #23948

Uh oh!

silwork99 Jul 7, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 2 comments · 14 replies

Uh oh!

dosubot[bot] bot Jul 7, 2024

Using Chroma

Using Faiss

Using Pinecone

Uh oh!

silwork99 Jul 7, 2024 Author

Uh oh!

dosubot[bot] bot Jul 7, 2024

Using Chroma

Using Faiss

Using Pinecone

Uh oh!

silwork99 Jul 7, 2024 Author

Uh oh!

silwork99 Jul 7, 2024 Author

Uh oh!

dosubot[bot] bot Jul 7, 2024

Chroma

Faiss

Pinecone

Uh oh!

silwork99 Jul 7, 2024 Author

Uh oh!

dosubot[bot] bot Jul 7, 2024

Uh oh!

ohjho Oct 15, 2025

silwork99
Jul 7, 2024

Replies: 2 comments 14 replies

dosubot[bot]
bot Jul 7, 2024

silwork99 Jul 7, 2024
Author

silwork99
Jul 7, 2024
Author

silwork99 Jul 7, 2024
Author

silwork99 Jul 7, 2024
Author