如何針對每個文件使用多個向量來檢索

通常，每個文件儲存多個向量可能很有用。在多種使用情境下，這都很有幫助。例如，我們可以嵌入文件的多個區塊，並將這些嵌入與父文件關聯，從而允許對區塊進行檢索器命中，以返回較大的文件。

LangChain 實作了一個基礎的 MultiVectorRetriever，簡化了這個過程。大部分的複雜性在於如何為每個文件建立多個向量。本筆記本涵蓋了一些建立這些向量並使用 MultiVectorRetriever 的常見方法。

為每個文件建立多個向量的方法包括

較小的區塊：將文件分割成較小的區塊，並嵌入這些區塊（這是 ParentDocumentRetriever）。
摘要：為每個文件建立摘要，將摘要與文件一起（或代替文件）嵌入。
假設性問題：建立每個文件都適合回答的假設性問題，將這些問題與文件一起（或代替文件）嵌入。

請注意，這也啟用了另一種新增嵌入的方法 - 手動。這很有用，因為您可以明確新增應導致文件被檢索的問題或查詢，從而讓您擁有更多控制權。

下面我們將逐步介紹一個範例。首先，我們實例化一些文件。我們將使用 OpenAI 嵌入將它們索引到（記憶體中）Chroma 向量儲存中，但任何 LangChain 向量儲存或嵌入模型都適用。

%pip install --upgrade --quiet  langchain-chroma langchain langchain-openai > /dev/null

from langchain.storage import InMemoryByteStore
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

loaders = [
    TextLoader("paul_graham_essay.txt"),
    TextLoader("state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)

API 參考：InMemoryByteStore | TextLoader | OpenAIEmbeddings | RecursiveCharacterTextSplitter

較小的區塊

通常，檢索較大的資訊區塊可能很有用，但嵌入較小的區塊。這允許嵌入盡可能準確地捕捉語義，但允許將盡可能多的上下文傳遞到下游。請注意，這就是 ParentDocumentRetriever 的作用。在這裡，我們展示了幕後發生的事情。

我們將區分向量儲存（索引（子）文件的嵌入）和文件儲存（容納「父」文件並將它們與識別符關聯）。

import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

doc_ids = [str(uuid.uuid4()) for _ in docs]

API 參考：MultiVectorRetriever

接下來，我們透過分割原始文件來產生「子」文件。請注意，我們將文件識別符儲存在相應 Document 物件的 metadata 中。

# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

最後，我們在向量儲存和文件儲存中索引文件

retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

單獨使用向量儲存將檢索小區塊

retriever.vectorstore.similarity_search("justice breyer")[0]

Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '064eca46-a4c4-4789-8e3b-583f9597e54f', 'source': 'state_of_the_union.txt'})

而檢索器將返回較大的父文件

len(retriever.invoke("justice breyer")[0].page_content)

檢索器在向量資料庫上執行的預設搜尋類型是相似度搜尋。LangChain 向量儲存也支援透過 Max Marginal Relevance 進行搜尋。這可以透過檢索器的 search_type 參數來控制

from langchain.retrievers.multi_vector import SearchType

retriever.search_type = SearchType.mmr

len(retriever.invoke("justice breyer")[0].page_content)

API 參考：SearchType

將摘要與文件關聯以進行檢索

摘要可能能夠更準確地提煉出區塊的內容，從而實現更好的檢索。在這裡，我們展示如何建立摘要，然後嵌入這些摘要。

我們建構一個簡單的鏈，它將接收一個輸入 Document 物件，並使用 LLM 產生摘要。

選擇聊天模型

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | llm
    | StrOutputParser()
)

API 參考：Document | StrOutputParser | ChatPromptTemplate

請注意，我們可以跨文件批次處理鏈

summaries = chain.batch(docs, {"max_concurrency": 5})

然後，我們可以像之前一樣初始化 MultiVectorRetriever，在我們的向量儲存中索引摘要，並在文件儲存中保留原始文件

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

# # We can also add the original chunks to the vectorstore if we so want
# for i, doc in enumerate(docs):
#     doc.metadata[id_key] = doc_ids[i]
# retriever.vectorstore.add_documents(docs)

查詢向量儲存將返回摘要

sub_docs = retriever.vectorstore.similarity_search("justice breyer")

sub_docs[0]

Document(page_content="President Biden recently nominated Judge Ketanji Brown Jackson to serve on the United States Supreme Court, emphasizing her qualifications and broad support. The President also outlined a plan to secure the border, fix the immigration system, protect women's rights, support LGBTQ+ Americans, and advance mental health services. He highlighted the importance of bipartisan unity in passing legislation, such as the Violence Against Women Act. The President also addressed supporting veterans, particularly those impacted by exposure to burn pits, and announced plans to expand benefits for veterans with respiratory cancers. Additionally, he proposed a plan to end cancer as we know it through the Cancer Moonshot initiative. President Biden expressed optimism about the future of America and emphasized the strength of the American people in overcoming challenges.", metadata={'doc_id': '84015b1b-980e-400a-94d8-cf95d7e079bd'})

而檢索器將返回較大的來源文件

retrieved_docs = retriever.invoke("justice breyer")

len(retrieved_docs[0].page_content)

假設性查詢

LLM 也可用於產生特定文件可能被問到的一系列假設性問題，這些問題可能與 RAG 應用程式中的相關查詢具有密切的語義相似性。然後可以嵌入這些問題並將其與文件關聯，以改進檢索。

下面，我們使用 with_structured_output 方法將 LLM 輸出結構化為字串列表。

from typing import List

from pydantic import BaseModel, Field


class HypotheticalQuestions(BaseModel):
    """Generate hypothetical questions."""

    questions: List[str] = Field(..., description="List of questions")


chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4o").with_structured_output(
        HypotheticalQuestions
    )
    | (lambda x: x.questions)
)

在單個文件上調用鏈表明它輸出了問題列表

chain.invoke(docs[0])

["What impact did the IBM 1401 have on the author's early programming experiences?",
 "How did the transition from using the IBM 1401 to microcomputers influence the author's programming journey?",
 "What role did Lisp play in shaping the author's understanding and approach to AI?"]

我們可以批次處理所有文件的鏈，並像之前一樣組裝我們的向量儲存和文件儲存

# Batch chain over documents to generate hypothetical questions
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})


# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]


# Generate Document objects from hypothetical questions
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )


retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

請注意，查詢底層向量儲存將檢索與輸入查詢語義相似的假設性問題

sub_docs = retriever.vectorstore.similarity_search("justice breyer")

sub_docs

[Document(page_content='What might be the potential benefits of nominating Circuit Court of Appeals Judge Ketanji Brown Jackson to the United States Supreme Court?', metadata={'doc_id': '43292b74-d1b8-4200-8a8b-ea0cb57fbcdb'}),
 Document(page_content='How might the Bipartisan Infrastructure Law impact the economic competition between the U.S. and China?', metadata={'doc_id': '66174780-d00c-4166-9791-f0069846e734'}),
 Document(page_content='What factors led to the creation of Y Combinator?', metadata={'doc_id': '72003c4e-4cc9-4f09-a787-0b541a65b38c'}),
 Document(page_content='How did the ability to publish essays online change the landscape for writers and thinkers?', metadata={'doc_id': 'e8d2c648-f245-4bcc-b8d3-14e64a164b64'})]

而調用檢索器將返回相應的文件

retrieved_docs = retriever.invoke("justice breyer")
len(retrieved_docs[0].page_content)

較小的區塊​

將摘要與文件關聯以進行檢索​

假設性查詢​

此頁面是否對您有幫助？

較小的區塊

將摘要與文件關聯以進行檢索

假設性查詢