如何建立自訂檢索器

概觀

許多 LLM 應用程式都涉及使用檢索器從外部資料來源檢索資訊。

檢索器的職責是檢索與給定使用者 query 相關的文件清單。

檢索到的文件通常會格式化為提示，並饋送至 LLM 中，讓 LLM 可以使用其中的資訊來產生適當的回應（例如，根據知識庫回答使用者問題）。

介面

若要建立自己的檢索器，您需要擴充 BaseRetriever 類別並實作下列方法

方法	描述	必要/選用
`_get_relevant_documents`	取得與查詢相關的文件。	必要
`_aget_relevant_documents`	實作以提供非同步原生支援。	選用

_get_relevant_documents 內部的邏輯可能涉及對資料庫或使用請求對 Web 的任意呼叫。

提示

透過繼承自 BaseRetriever，您的檢索器會自動成為 LangChain 可執行物件，並將開箱即用地獲得標準 Runnable 功能！

資訊

您可以使用 RunnableLambda 或 RunnableGenerator 來實作檢索器。

將檢索器實作為 BaseRetriever 與 RunnableLambda（自訂可執行函式）的主要優點是，BaseRetriever 是眾所周知的 LangChain 實體，因此某些監控工具可能會針對檢索器實作特定的行為。另一個差異是，BaseRetriever 在某些 API 中的行為會與 RunnableLambda 略有不同；例如，astream_events API 中的 start 事件會是 on_retriever_start 而不是 on_chain_start。

範例

讓我們實作一個玩具檢索器，其會傳回所有文字包含使用者查詢中文字的文件。

from typing import List

from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever


class ToyRetriever(BaseRetriever):
    """A toy retriever that contains the top k documents that contain the user query.

    This retriever only implements the sync method _get_relevant_documents.

    If the retriever were to involve file access or network access, it could benefit
    from a native async implementation of `_aget_relevant_documents`.

    As usual, with Runnables, there's a default async implementation that's provided
    that delegates to the sync implementation running on another thread.
    """

    documents: List[Document]
    """List of documents to retrieve from."""
    k: int
    """Number of top results to return"""

    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Sync implementations for retriever."""
        matching_documents = []
        for document in documents:
            if len(matching_documents) > self.k:
                return matching_documents

            if query.lower() in document.page_content.lower():
                matching_documents.append(document)
        return matching_documents

    # Optional: Provide a more efficient native implementation by overriding
    # _aget_relevant_documents
    # async def _aget_relevant_documents(
    #     self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun
    # ) -> List[Document]:
    #     """Asynchronously get documents relevant to a query.

    #     Args:
    #         query: String to find relevant documents for
    #         run_manager: The callbacks handler to use

    #     Returns:
    #         List of relevant documents
    #     """

API 參考：CallbackManagerForRetrieverRun | Document | BaseRetriever

測試一下 🧪

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"type": "dog", "trait": "loyalty"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"type": "cat", "trait": "independence"},
    ),
    Document(
        page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
        metadata={"type": "fish", "trait": "low maintenance"},
    ),
    Document(
        page_content="Parrots are intelligent birds capable of mimicking human speech.",
        metadata={"type": "bird", "trait": "intelligence"},
    ),
    Document(
        page_content="Rabbits are social animals that need plenty of space to hop around.",
        metadata={"type": "rabbit", "trait": "social"},
    ),
]
retriever = ToyRetriever(documents=documents, k=3)

retriever.invoke("that")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]

它是一個可執行物件，因此它將受益於標準的可執行介面！🤩

await retriever.ainvoke("that")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]

retriever.batch(["dog", "cat"])

[[Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'type': 'dog', 'trait': 'loyalty'})],
 [Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'})]]

async for event in retriever.astream_events("bar", version="v1"):
    print(event)

{'event': 'on_retriever_start', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'name': 'ToyRetriever', 'tags': [], 'metadata': {}, 'data': {'input': 'bar'}}
{'event': 'on_retriever_stream', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'name': 'ToyRetriever', 'data': {'chunk': []}}
{'event': 'on_retriever_end', 'name': 'ToyRetriever', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'data': {'output': []}}

貢獻

我們感謝您貢獻有趣的檢索器！

以下清單可協助確保您的貢獻新增至 LangChain

文件

檢索器包含所有初始化引數的文件字串，因為這些引數將會顯示在API 參考中。
模型的類別文件字串包含檢索器所使用任何相關 API 的連結（例如，如果檢索器從維基百科檢索，則最好連結到維基百科 API！）。

測試

新增單元或整合測試，以驗證 invoke 和 ainvoke 是否正常運作。

最佳化

如果檢索器連線到外部資料來源（例如，API 或檔案），則幾乎可以肯定會受益於非同步原生最佳化！

提供 _aget_relevant_documents 的原生非同步實作（由 ainvoke 使用）

概觀​

介面​

範例​

測試一下 🧪​

貢獻​

此頁面是否對您有幫助？

概觀

介面

範例

測試一下 🧪

貢獻