建立語義搜尋引擎
本教學將幫助您熟悉 LangChain 的文件載入器、嵌入和向量儲存庫抽象概念。這些抽象概念旨在支持從(向量)資料庫和其他來源檢索資料,以便與 LLM 工作流程整合。它們對於需要在模型推論過程中獲取資料進行推理的應用程式非常重要,例如檢索增強生成或 RAG(請參閱我們的 RAG 教學 此處)。
在這裡,我們將基於 PDF 文件建立一個搜尋引擎。這將使我們能夠檢索 PDF 中與輸入查詢相似的段落。
概念
本指南著重於文本資料的檢索。我們將涵蓋以下概念
- 文件和文件載入器;
- 文本分割器;
- 嵌入;
- 向量儲存庫和檢索器。
設定
Jupyter Notebook
本教學和其他教學或許最方便在 Jupyter Notebook 中運行。請參閱 此處 以獲取安裝說明。
安裝
本教學需要 langchain-community
和 pypdf
套件
- Pip
- Conda
pip install langchain-community pypdf
conda install langchain-community pypdf -c conda-forge
更多詳細資訊,請參閱我們的 安裝指南。
LangSmith
您使用 LangChain 建立的許多應用程式將包含多個步驟,並多次調用 LLM。隨著這些應用程式變得越來越複雜,能夠檢查您的鏈或代理程式內部到底發生了什麼變得至關重要。最好的方法是使用 LangSmith。
在您通過上面的連結註冊後,請確保設定您的環境變數以開始記錄追蹤
export LANGSMITH_TRACING="true"
export LANGSMITH_API_KEY="..."
或者,如果在 Notebook 中,您可以使用以下方式設定它們
import getpass
import os
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()
文件和文件載入器
LangChain 實現了 Document 抽象,旨在表示文本單位和相關元數據。它具有三個屬性
page_content
:代表內容的字串;metadata
:包含任意元數據的字典;id
:(可選)文件的字串標識符。
`metadata` 屬性可以捕獲有關文件來源、其與其他文件的關係以及其他資訊。請注意,單個 `Document` 物件通常代表較大文件的一部分。
我們可以在需要時生成範例文件
from langchain_core.documents import Document
documents = [
Document(
page_content="Dogs are great companions, known for their loyalty and friendliness.",
metadata={"source": "mammal-pets-doc"},
),
Document(
page_content="Cats are independent pets that often enjoy their own space.",
metadata={"source": "mammal-pets-doc"},
),
]
然而,LangChain 生態系統實現了 文件載入器,可以 與數百種常見來源整合。這使得將來自這些來源的資料整合到您的 AI 應用程式中變得容易。
載入文件
讓我們將 PDF 載入到 `Document` 物件序列中。LangChain repo 此處 有一個範例 PDF — 2023 年 Nike 的 10-k 文件。我們可以查閱 LangChain 文件以獲取 可用的 PDF 文件載入器。讓我們選擇 PyPDFLoader,它相當輕量級。
from langchain_community.document_loaders import PyPDFLoader
file_path = "../example_data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()
print(len(docs))
107
有關 PDF 文件載入器的更多詳細資訊,請參閱 本指南。
`PyPDFLoader` 每 PDF 頁面載入一個 `Document` 物件。對於每個物件,我們可以輕鬆訪問
- 頁面的字串內容;
- 包含檔案名稱和頁碼的元數據。
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FO
{'source': '../example_data/nke-10k-2023.pdf', 'page': 0}
分割
為了資訊檢索和下游問答的目的,頁面可能太過粗略。我們的最終目標是檢索回答輸入查詢的 `Document` 物件,進一步分割我們的 PDF 將有助於確保文件中相關部分的含義不會被周圍的文字「沖淡」。
我們可以使用 文本分割器 來達到此目的。在這裡,我們將使用一個簡單的文本分割器,它基於字元進行分割。我們將把文件分割成 1000 個字元的區塊,區塊之間有 200 個字元的重疊。重疊有助於減輕將陳述與與之相關的重要上下文分開的可能性。我們使用 RecursiveCharacterTextSplitter,它將使用常見的分隔符(如換行符)遞迴地分割文件,直到每個區塊都達到適當的大小。這是通用文本用例的推薦文本分割器。
我們設定 `add_start_index=True`,以便將每個分割的 `Document` 在初始 `Document` 中開始的字元索引保留為元數據屬性「start_index」。
有關使用 PDF 的更多詳細資訊,包括如何從特定部分和圖像中提取文本,請參閱 本指南。
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)
len(all_splits)
514
嵌入
向量搜尋是一種常見的儲存和搜尋非結構化資料(例如非結構化文本)的方法。這個想法是儲存與文本關聯的數字向量。給定一個查詢,我們可以將其 嵌入 為相同維度的向量,並使用向量相似度指標(例如餘弦相似度)來識別相關文本。
LangChain 支持來自 數十個供應商的嵌入。這些模型指定了如何將文本轉換為數字向量。讓我們選擇一個模型
pip install -qU langchain-openai
import getpass
import os
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)
assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])
Generated vectors of length 1536
[-0.008586574345827103, -0.03341241180896759, -0.008936782367527485, -0.0036674530711025, 0.010564599186182022, 0.009598285891115665, -0.028587326407432556, -0.015824200585484505, 0.0030416189692914486, -0.012899317778646946]
有了用於生成文本嵌入的模型,我們接下來可以將它們儲存在支持高效相似度搜尋的特殊資料結構中。
向量儲存庫
LangChain VectorStore 物件包含用於將文本和 `Document` 物件添加到儲存庫的方法,以及使用各種相似度指標查詢它們的方法。它們通常使用 嵌入 模型 初始化,這些模型決定了如何將文本資料轉換為數字向量。
LangChain 包含一套與不同向量儲存庫技術的整合方案。某些向量儲存庫由供應商託管(例如,各種雲端供應商),並且需要特定的憑證才能使用;某些(例如 Postgres)在可以本地或通過第三方運行的單獨基礎架構中運行;其他則可以在記憶體中運行,用於輕量級工作負載。讓我們選擇一個向量儲存庫
pip install -qU langchain-core
from langchain_core.vectorstores import InMemoryVectorStore
vector_store = InMemoryVectorStore(embeddings)
實例化我們的向量儲存庫後,我們現在可以為文件建立索引。
ids = vector_store.add_documents(documents=all_splits)
請注意,大多數向量儲存庫實現都允許您連接到現有的向量儲存庫 — 例如,通過提供客戶端、索引名稱或其他資訊。有關更多詳細資訊,請參閱特定 整合方案 的文件。
一旦我們實例化了一個包含文件的 `VectorStore`,我們就可以查詢它。`VectorStore` 包含用於查詢的方法
- 同步和非同步;
- 通過字串查詢和通過向量;
- 返回和不返回相似度分數;
- 通過相似度和 最大邊際相關性 (以平衡與查詢的相似度,以實現檢索結果的多樣性)。
這些方法通常會在輸出中包含 Document 物件列表。
用法
嵌入通常將文本表示為「密集」向量,使得含義相似的文本在幾何上接近。這使我們能夠僅通過傳入問題來檢索相關資訊,而無需了解文件中使用的任何特定關鍵字詞。
根據與字串查詢的相似度返回文件
results = vector_store.similarity_search(
"How many distribution centers does Nike have in the US?"
)
print(results[0])
page_content='direct to consumer operations sell products through the following number of retail stores in the United States:
U.S. RETAIL STORES NUMBER
NIKE Brand factory stores 213
NIKE Brand in-line stores (including employee-only stores) 74
Converse stores (including factory stores) 82
TOTAL 369
In the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.
2023 FORM 10-K 2' metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}
非同步查詢
results = await vector_store.asimilarity_search("When was Nike incorporated?")
print(results[0])
page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}
返回分數
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.
results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)
Score: 0.23699893057346344
page_content='Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS
The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.' metadata={'page': 35, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}
根據與嵌入查詢的相似度返回文件
embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")
results = vector_store.similarity_search_by_vector(embedding)
print(results[0])
page_content='Table of Contents
GROSS MARGIN
FISCAL 2023 COMPARED TO FISCAL 2022
For fiscal 2023, our consolidated gross profit increased 4% to $22,292 million compared to $21,479 million for fiscal 2022. Gross margin decreased 250 basis points to
43.5% for fiscal 2023 compared to 46.0% for fiscal 2022 due to the following:
*Wholesale equivalent
The decrease in gross margin for fiscal 2023 was primarily due to:
•Higher NIKE Brand product costs, on a wholesale equivalent basis, primarily due to higher input costs and elevated inbound freight and logistics costs as well as
product mix;
•Lower margin in our NIKE Direct business, driven by higher promotional activity to liquidate inventory in the current period compared to lower promotional activity in
the prior period resulting from lower available inventory supply;
•Unfavorable changes in net foreign currency exchange rates, including hedges; and
•Lower off-price margin, on a wholesale equivalent basis.
This was partially offset by:' metadata={'page': 36, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}
了解更多
檢索器
LangChain `VectorStore` 物件不繼承 Runnable。LangChain Retrievers 是 `Runnables`,因此它們實現了一組標準方法(例如,同步和非同步 `invoke` 和 `batch` 操作)。儘管我們可以從向量儲存庫建構檢索器,但檢索器也可以與非向量儲存庫資料來源(例如外部 API)介接。
我們可以自己創建一個簡單的版本,而無需繼承 `Retriever`。如果我們選擇我們希望用來檢索文件的方法,我們可以輕鬆地創建一個可執行物件。下面我們將圍繞 `similarity_search` 方法構建一個
from typing import List
from langchain_core.documents import Document
from langchain_core.runnables import chain
@chain
def retriever(query: str) -> List[Document]:
return vector_store.similarity_search(query, k=1)
retriever.batch(
[
"How many distribution centers does Nike have in the US?",
"When was Nike incorporated?",
],
)
[[Document(metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')],
[Document(metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}, page_content='Table of Contents\nPART I\nITEM 1. BUSINESS\nGENERAL\nNIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.\nOur principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is\nthe largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores\nand sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales')]]
向量儲存庫實現了 `as_retriever` 方法,該方法將生成一個 `Retriever`,特別是 VectorStoreRetriever。這些檢索器包括特定的 `search_type` 和 `search_kwargs` 屬性,用於識別要調用底層向量儲存庫的哪些方法,以及如何參數化它們。例如,我們可以通過以下方式複製上述內容
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 1},
)
retriever.batch(
[
"How many distribution centers does Nike have in the US?",
"When was Nike incorporated?",
],
)
[[Document(metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')],
[Document(metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}, page_content='Table of Contents\nPART I\nITEM 1. BUSINESS\nGENERAL\nNIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.\nOur principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is\nthe largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores\nand sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales')]]
`VectorStoreRetriever` 支持 `similarity`(預設)、`mmr`(最大邊際相關性,如上所述)和 `similarity_score_threshold` 的搜尋類型。我們可以使用後者通過相似度分數對檢索器輸出的文件進行閾值處理。
檢索器可以輕鬆地整合到更複雜的應用程式中,例如檢索增強生成 (RAG)應用程式,這些應用程式將給定的問題與檢索到的上下文組合到 LLM 的提示中。要了解有關構建此類應用程式的更多資訊,請查看 RAG 教學 教學。
了解更多:
檢索策略可以豐富而複雜。例如
- 我們可以從查詢中推斷出硬性規則和過濾器(例如,「使用 2020 年之後發布的文件」);
- 我們可以返回與檢索到的上下文以某種方式連結的文件(例如,通過某些文件分類法);
- 我們可以為每個上下文單元生成多個嵌入;
- 我們可以從多個檢索器集成結果;
- 我們可以為文件分配權重,例如,權衡最近的文件更高。
操作指南的 檢索器 部分涵蓋了這些和其他內建檢索策略。
擴展 BaseRetriever 類以實現自訂檢索器也很簡單。請參閱我們的操作指南 此處。
後續步驟
您現在已經了解了如何基於 PDF 文件建立語義搜尋引擎。
有關文件載入器的更多資訊
有關嵌入的更多資訊
有關向量儲存庫的更多資訊
有關 RAG 的更多資訊,請參閱