RAGatouille

RAGatouille 讓使用 ColBERT 變得無比簡單！ColBERT 是一個快速且精準的檢索模型，能夠在數十毫秒內針對大型文字集合進行可擴展的 BERT 基礎搜尋。

請參閱 ColBERTv2：透過輕量級後期互動實現有效率且高效能的檢索論文。

我們有多種方式可以使用 RAGatouille。

設定

此整合位於 ragatouille 套件中。

pip install -U ragatouille

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

[Jan 10, 10:53:28] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
``````output
/Users/harrisonchase/.pyenv/versions/3.10.1/envs/langchain/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(

檢索器

我們可以將 RAGatouille 作為檢索器使用。如需更多資訊，請參閱 RAGatouille 檢索器

文件壓縮器

我們也可以直接將 RAGatouille 作為重新排序器使用。這能讓我們使用 ColBERT 重新排序來自任何通用檢索器的檢索結果。這樣做的好處是，我們可以在任何現有的索引之上執行此操作，因此不需要建立新的索引。我們可以透過使用 LangChain 中的文件壓縮器抽象概念來完成此操作。

設定 Vanilla 檢索器

首先，讓我們設定一個 Vanilla 檢索器作為範例。

import requests
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter


def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.

    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {"User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"}

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data["query"]["pages"].values()))
    return page["extract"] if "extract" in page else None


text = get_wikipedia_page("Hayao_Miyazaki")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.create_documents([text])

API 參考：FAISS | OpenAIEmbeddings | RecursiveCharacterTextSplitter

retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever(
    search_kwargs={"k": 10}
)

docs = retriever.invoke("What animation studio did Miyazaki found")
docs[0]

Document(page_content='collaborative projects. In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.')

我們可以看到結果與所提出的問題並非非常相關

使用 ColBERT 作為重新排序器

from langchain.retrievers import ContextualCompressionRetriever

compression_retriever = ContextualCompressionRetriever(
    base_compressor=RAG.as_langchain_document_compressor(), base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "What animation studio did Miyazaki found"
)

API 參考：ContextualCompressionRetriever

/Users/harrisonchase/.pyenv/versions/3.10.1/envs/langchain/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(

compressed_docs[0]

Document(page_content='In June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli\'s first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki\'s designs for the film\'s setting were inspired by Greek architecture and "European urbanistic templates". Some of the architecture in the film was also inspired by a Welsh mining town; Miyazaki witnessed the mining strike upon his first', metadata={'relevance_score': 26.5194149017334})

這個答案更相關！

設定​

檢索器​

文件壓縮器​

設定 Vanilla 檢索器​

使用 ColBERT 作為重新排序器​

此頁面是否對您有幫助？

設定

檢索器

文件壓縮器

設定 Vanilla 檢索器

使用 ColBERT 作為重新排序器