跳到主要內容
Open In ColabOpen on GitHub

如何將分數新增至檢索器結果

檢索器 將傳回 Document 物件序列,預設情況下,這些物件不包含有關檢索它們的過程的任何資訊(例如,針對查詢的相似度分數)。在這裡,我們示範如何將檢索分數新增至文件的 .metadata

  1. 來自 向量儲存區檢索器
  2. 來自更高階 LangChain 檢索器,例如 SelfQueryRetrieverMultiVectorRetriever

對於 (1),我們將圍繞對應的 向量儲存區 實作一個簡短的包裝函式。對於 (2),我們將更新對應類別的方法。

建立向量儲存區

首先,我們使用一些資料填入向量儲存區。我們將使用 PineconeVectorStore,但本指南與任何實作 .similarity_search_with_score 方法的 LangChain 向量儲存區相容。

from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

docs = [
Document(
page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
),
Document(
page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
),
Document(
page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
),
Document(
page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
),
Document(
page_content="Toys come alive and have a blast doing so",
metadata={"year": 1995, "genre": "animated"},
),
Document(
page_content="Three men walk into the Zone, three men walk out of the Zone",
metadata={
"year": 1979,
"director": "Andrei Tarkovsky",
"genre": "thriller",
"rating": 9.9,
},
),
]

vectorstore = PineconeVectorStore.from_documents(
docs, index_name="sample", embedding=OpenAIEmbeddings()
)

檢索器

若要從向量儲存區檢索器取得分數,我們將基礎向量儲存區的 .similarity_search_with_score 方法包裝在一個簡短的函式中,該函式將分數封裝到關聯文件的中繼資料中。

我們將 @chain 裝飾器新增至函式,以建立一個 Runnable,其使用方式與典型的檢索器類似。

from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
docs, scores = zip(*vectorstore.similarity_search_with_score(query))
for doc, score in zip(docs, scores):
doc.metadata["score"] = score

return docs
API 參考:Document | chain
result = retriever.invoke("dinosaur")
result
(Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993.0, 'score': 0.84429127}),
Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995.0, 'score': 0.792038262}),
Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979.0, 'score': 0.751571238}),
Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006.0, 'score': 0.747471571}))

請注意,來自檢索步驟的相似度分數包含在上述文件的中繼資料中。

SelfQueryRetriever

SelfQueryRetriever 將使用 LLM 產生可能結構化的查詢——例如,它可以建構篩選器以進行檢索,並在常見的語意相似度驅動選取之上進行。請參閱 本指南 以取得更多詳細資訊。

SelfQueryRetriever 包含一個簡短的(1 - 2 行)方法 _get_docs_with_query,用於執行 vectorstore 搜尋。我們可以子類別化 SelfQueryRetriever 並覆寫此方法以傳播相似度分數。

首先,依照 操作指南,我們需要建立一些要篩選的中繼資料

from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import ChatOpenAI

metadata_field_info = [
AttributeInfo(
name="genre",
description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
type="string",
),
AttributeInfo(
name="year",
description="The year the movie was released",
type="integer",
),
AttributeInfo(
name="director",
description="The name of the movie director",
type="string",
),
AttributeInfo(
name="rating", description="A 1-10 rating for the movie", type="float"
),
]
document_content_description = "Brief summary of a movie"
llm = ChatOpenAI(temperature=0)

然後,我們覆寫 _get_docs_with_query 以使用基礎向量儲存區的 similarity_search_with_score 方法

from typing import Any, Dict


class CustomSelfQueryRetriever(SelfQueryRetriever):
def _get_docs_with_query(
self, query: str, search_kwargs: Dict[str, Any]
) -> List[Document]:
"""Get docs, adding score information."""
docs, scores = zip(
*self.vectorstore.similarity_search_with_score(query, **search_kwargs)
)
for doc, score in zip(docs, scores):
doc.metadata["score"] = score

return docs

現在調用此檢索器將在中繼資料中包含相似度分數。請注意,SelfQueryRetriever 的基礎結構化查詢功能已保留。

retriever = CustomSelfQueryRetriever.from_llm(
llm,
vectorstore,
document_content_description,
metadata_field_info,
)


result = retriever.invoke("dinosaur movie with rating less than 8")
result
(Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993.0, 'score': 0.84429127}),)

MultiVectorRetriever

MultiVectorRetriever 可讓您將多個向量與單一文件建立關聯。這在許多應用程式中都很有用。例如,我們可以索引較大文件的較小區塊,並在區塊上執行檢索,但在調用檢索器時傳回較大的「父」文件。ParentDocumentRetrieverMultiVectorRetriever 的子類別,包含用於填入向量儲存區以支援此功能的便利方法。更多應用程式詳述於本操作指南中。

若要透過此檢索器傳播相似度分數,我們可以再次子類別化 MultiVectorRetriever 並覆寫方法。這次我們將覆寫 _get_relevant_documents

首先,我們準備一些虛假資料。我們產生虛假的「完整文件」並將其儲存在文件儲存區中;在這裡,我們將使用簡單的 InMemoryStore

from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

# The storage layer for the parent documents
docstore = InMemoryStore()
fake_whole_documents = [
("fake_id_1", Document(page_content="fake whole document 1")),
("fake_id_2", Document(page_content="fake whole document 2")),
]
docstore.mset(fake_whole_documents)

接下來,我們將一些虛假的「子文件」新增至我們的向量儲存區。我們可以透過填入其中繼資料中的 "doc_id" 金鑰,將這些子文件連結到父文件。

docs = [
Document(
page_content="A snippet from a larger document discussing cats.",
metadata={"doc_id": "fake_id_1"},
),
Document(
page_content="A snippet from a larger document discussing discourse.",
metadata={"doc_id": "fake_id_1"},
),
Document(
page_content="A snippet from a larger document discussing chocolate.",
metadata={"doc_id": "fake_id_2"},
),
]

vectorstore.add_documents(docs)
['62a85353-41ff-4346-bff7-be6c8ec2ed89',
'5d4a0e83-4cc5-40f1-bc73-ed9cbad0ee15',
'8c1d9a56-120f-45e4-ba70-a19cd19a38f4']

若要傳播分數,我們子類別化 MultiVectorRetriever 並覆寫其 _get_relevant_documents 方法。在這裡,我們將進行兩項變更

  1. 我們將使用上述基礎向量儲存區的 similarity_search_with_score 方法,將相似度分數新增至對應「子文件」的中繼資料;
  2. 我們將在檢索到的父文件的中繼資料中包含這些子文件的清單。這會呈現檢索識別的文字片段,以及它們對應的相似度分數。
from collections import defaultdict

from langchain.retrievers import MultiVectorRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun


class CustomMultiVectorRetriever(MultiVectorRetriever):
def _get_relevant_documents(
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
"""Get documents relevant to a query.
Args:
query: String to find relevant documents for
run_manager: The callbacks handler to use
Returns:
List of relevant documents
"""
results = self.vectorstore.similarity_search_with_score(
query, **self.search_kwargs
)

# Map doc_ids to list of sub-documents, adding scores to metadata
id_to_doc = defaultdict(list)
for doc, score in results:
doc_id = doc.metadata.get("doc_id")
if doc_id:
doc.metadata["score"] = score
id_to_doc[doc_id].append(doc)

# Fetch documents corresponding to doc_ids, retaining sub_docs in metadata
docs = []
for _id, sub_docs in id_to_doc.items():
docstore_docs = self.docstore.mget([_id])
if docstore_docs:
if doc := docstore_docs[0]:
doc.metadata["sub_docs"] = sub_docs
docs.append(doc)

return docs

調用此檢索器,我們可以看見它識別出正確的父文件,包括來自子文件的相關片段以及相似度分數。

retriever = CustomMultiVectorRetriever(vectorstore=vectorstore, docstore=docstore)

retriever.invoke("cat")
[Document(page_content='fake whole document 1', metadata={'sub_docs': [Document(page_content='A snippet from a larger document discussing cats.', metadata={'doc_id': 'fake_id_1', 'score': 0.831276655})]})]

此頁面是否對您有幫助?