Activeloop Deep Memory
Activeloop Deep Memory 是一套工具,可讓您針對您的使用案例優化向量儲存,並在 LLM 應用程式中實現更高的準確性。
Retrieval-Augmented Generatation
(RAG
) 近期獲得了顯著的關注。隨著進階 RAG 技術和代理的出現,它們擴展了 RAG 可以實現的潛力。然而,一些挑戰可能會限制 RAG 整合到生產環境中。在生產環境中實作 RAG 時,主要需要考量的因素是準確性(召回率)、成本和延遲。對於基本的使用案例,OpenAI 的 Ada 模型搭配簡單的相似性搜尋即可產生令人滿意的結果。然而,為了在搜尋期間獲得更高的準確性或召回率,可能需要採用進階的檢索技術。這些方法可能涉及改變數據塊大小、多次重寫查詢等等,這可能會增加延遲和成本。Activeloop 的 Deep Memory 是 Activeloop Deep Lake
用戶可使用的功能,透過引入一個微小的神經網路層來解決這些問題,該網路層經過訓練,可將用戶查詢與語料庫中的相關資料進行匹配。雖然這種添加在搜尋期間產生的延遲極小,但它可以將檢索準確性提高多達 27%,並且仍然具有成本效益且易於使用,無需任何額外的進階 RAG 技術。
在本教學中,我們將解析 DeepLake
文件,並建立一個 RAG 系統,該系統可以回答文件中的問題。
1. 數據集建立
在本教學中,我們將使用 BeautifulSoup
庫和 LangChain 的文件解析器(如 Html2TextTransformer
、AsyncHtmlLoader
)來解析 activeloop 的文件。因此,我們需要安裝以下庫
%pip install --upgrade --quiet tiktoken langchain-openai python-dotenv datasets langchain deeplake beautifulsoup4 html2text ragas
此外,您還需要建立一個 Activeloop 帳戶。
ORG_ID = "..."
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import DeepLake
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import getpass
import os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API token: ")
# # activeloop token is needed if you are not signed in using CLI: `activeloop login -u <USERNAME> -p <PASSWORD>`
if "ACTIVELOOP_TOKEN" not in os.environ:
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass(
"Enter your ActiveLoop API token: "
) # Get your API token from https://app.activeloop.ai, click on your profile picture in the top right corner, and select "API Tokens"
token = os.getenv("ACTIVELOOP_TOKEN")
openai_embeddings = OpenAIEmbeddings()
db = DeepLake(
dataset_path=f"hub://{ORG_ID}/deeplake-docs-deepmemory", # org_id stands for your username or organization from activeloop
embedding=openai_embeddings,
runtime={"tensor_db": True},
token=token,
# overwrite=True, # user overwrite flag if you want to overwrite the full dataset
read_only=False,
)
使用 BeautifulSoup
解析網頁中的所有連結
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
def get_all_links(url):
response = requests.get(url)
if response.status_code != 200:
print(f"Failed to retrieve the page: {url}")
return []
soup = BeautifulSoup(response.content, "html.parser")
# Finding all 'a' tags which typically contain href attribute for links
links = [
urljoin(url, a["href"]) for a in soup.find_all("a", href=True) if a["href"]
]
return links
base_url = "https://docs.deeplake.ai/en/latest/"
all_links = get_all_links(base_url)
載入資料
from langchain_community.document_loaders.async_html import AsyncHtmlLoader
loader = AsyncHtmlLoader(all_links)
docs = loader.load()
將資料轉換為使用者可讀的格式
from langchain_community.document_transformers import Html2TextTransformer
html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)
現在,讓我們進一步將文件分塊,因為有些文件包含過多的文字
from langchain_text_splitters import RecursiveCharacterTextSplitter
chunk_size = 4096
docs_new = []
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
)
for doc in docs_transformed:
if len(doc.page_content) < chunk_size:
docs_new.append(doc)
else:
docs = text_splitter.create_documents([doc.page_content])
docs_new.extend(docs)
填充向量儲存
docs = db.add_documents(docs_new)
2. 生成合成查詢並訓練 Deep Memory
下一步是訓練 deep_memory 模型,該模型將使用戶查詢與您已有的數據集對齊。如果您還沒有任何用戶查詢,請別擔心,我們將使用 LLM 生成它們!
TODO: 新增圖片
上面我們展示了 deep_memory 的整體架構如何運作。因此,如您所見,為了訓練它,您需要相關性、查詢以及語料庫資料(我們要查詢的資料)。語料庫資料已在上一節中填充,這裡我們將生成問題和相關性。
questions
- 是一個字串文本,其中每個字串代表一個查詢relevance
- 包含每個問題的真實來源連結。可能有幾個文件包含給定問題的答案。因此,relevenve 是List[List[tuple[str, float]]]
,其中外部列表代表查詢,內部列表代表相關文件。Tuple 包含 str、float 對,其中字串代表來源文件 (對應於數據集中的id
張量) 的 ID,而 float 對應於目前文件與問題的相關程度。
現在,讓我們生成合成問題和相關性
from typing import List
from langchain.chains.openai_functions import (
create_structured_output_chain,
)
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
# fetch dataset docs and ids if they exist (optional you can also ingest)
docs = db.vectorstore.dataset.text.data(fetch_chunks=True, aslist=True)["value"]
ids = db.vectorstore.dataset.id.data(fetch_chunks=True, aslist=True)["value"]
# If we pass in a model explicitly, we need to make sure it supports the OpenAI function-calling API.
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
class Questions(BaseModel):
"""Identifying information about a person."""
question: str = Field(..., description="Questions about text")
prompt_msgs = [
SystemMessage(
content="You are a world class expert for generating questions based on provided context. \
You make sure the question can be answered by the text."
),
HumanMessagePromptTemplate.from_template(
"Use the given text to generate a question from the following input: {input}"
),
HumanMessage(content="Tips: Make sure to answer in the correct format"),
]
prompt = ChatPromptTemplate(messages=prompt_msgs)
chain = create_structured_output_chain(Questions, llm, prompt, verbose=True)
text = "# Understanding Hallucinations and Bias ## **Introduction** In this lesson, we'll cover the concept of **hallucinations** in LLMs, highlighting their influence on AI applications and demonstrating how to mitigate them using techniques like the retriever's architectures. We'll also explore **bias** within LLMs with examples."
questions = chain.run(input=text)
print(questions)
import random
from langchain_openai import OpenAIEmbeddings
from tqdm import tqdm
def generate_queries(docs: List[str], ids: List[str], n: int = 100):
questions = []
relevances = []
pbar = tqdm(total=n)
while len(questions) < n:
# 1. randomly draw a piece of text and relevance id
r = random.randint(0, len(docs) - 1)
text, label = docs[r], ids[r]
# 2. generate queries and assign and relevance id
generated_qs = [chain.run(input=text).question]
questions.extend(generated_qs)
relevances.extend([[(label, 1)] for _ in generated_qs])
pbar.update(len(generated_qs))
if len(questions) % 10 == 0:
print(f"q: {len(questions)}")
return questions[:n], relevances[:n]
chain = create_structured_output_chain(Questions, llm, prompt, verbose=False)
questions, relevances = generate_queries(docs, ids, n=200)
train_questions, train_relevances = questions[:100], relevances[:100]
test_questions, test_relevances = questions[100:], relevances[100:]
現在我們建立 100 個訓練查詢以及 100 個測試查詢。現在讓我們訓練 deep_memory
job_id = db.vectorstore.deep_memory.train(
queries=train_questions,
relevance=train_relevances,
)
讓我們追蹤訓練進度
db.vectorstore.deep_memory.status("6538939ca0b69a9ca45c528c")
--------------------------------------------------------------
| 6538e02ecda4691033a51c5b |
--------------------------------------------------------------
| status | completed |
--------------------------------------------------------------
| progress | eta: 1.4 seconds |
| | recall@10: 79.00% (+34.00%) |
--------------------------------------------------------------
| results | recall@10: 79.00% (+34.00%) |
--------------------------------------------------------------
3. 評估 Deep Memory 效能
太棒了,我們已經訓練好模型了!它顯示召回率有顯著提高,但是我們現在如何使用它並在新數據上進行評估呢?在本節中,我們將深入研究模型評估和推理部分,並了解如何在 LangChain 中使用它以提高檢索準確性
3.1 Deep Memory 評估
首先,我們可以使用的 deep_memory 內建評估方法。它計算幾個 recall
指標。這可以透過幾行程式碼輕鬆完成。
recall = db.vectorstore.deep_memory.evaluate(
queries=test_questions,
relevance=test_relevances,
)
Embedding queries took 0.81 seconds
---- Evaluating without model ----
Recall@1: 9.0%
Recall@3: 19.0%
Recall@5: 24.0%
Recall@10: 42.0%
Recall@50: 93.0%
Recall@100: 98.0%
---- Evaluating with model ----
Recall@1: 19.0%
Recall@3: 42.0%
Recall@5: 49.0%
Recall@10: 69.0%
Recall@50: 97.0%
Recall@100: 97.0%
它也顯示在未見過的測試數據集上也有相當大的改進!!!
3.2 Deep Memory + RAGas
from ragas.langchain import RagasEvaluatorChain
from ragas.metrics import (
context_recall,
)
讓我們將召回率轉換為真實來源
def convert_relevance_to_ground_truth(docs, relevance):
ground_truths = []
for rel in relevance:
ground_truth = []
for doc_id, _ in rel:
ground_truth.append(docs[doc_id])
ground_truths.append(ground_truth)
return ground_truths
ground_truths = convert_relevance_to_ground_truth(docs, test_relevances)
for deep_memory in [False, True]:
print("\nEvaluating with deep_memory =", deep_memory)
print("===================================")
retriever = db.as_retriever()
retriever.search_kwargs["deep_memory"] = deep_memory
qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-3.5-turbo"),
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
)
metrics = {
"context_recall_score": 0,
}
eval_chains = {m.name: RagasEvaluatorChain(metric=m) for m in [context_recall]}
for question, ground_truth in zip(test_questions, ground_truths):
result = qa_chain({"query": question})
result["ground_truths"] = ground_truth
for name, eval_chain in eval_chains.items():
score_name = f"{name}_score"
metrics[score_name] += eval_chain(result)[score_name]
for metric in metrics:
metrics[metric] /= len(test_questions)
print(f"{metric}: {metrics[metric]}")
print("===================================")
Evaluating with deep_memory = False
===================================
context_recall_score = 0.3763423145
===================================
Evaluating with deep_memory = True
===================================
context_recall_score = 0.5634545323
===================================
3.3 Deep Memory 推理
TODO: 新增圖片
使用 deep_memory
retriever = db.as_retriever()
retriever.search_kwargs["deep_memory"] = True
retriever.search_kwargs["k"] = 10
query = "Deamination of cytidine to uridine on the minus strand of viral DNA results in catastrophic G-to-A mutations in the viral genome."
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4"), chain_type="stuff", retriever=retriever
)
print(qa.run(query))
The base htype of the 'video_seq' tensor is 'video'.
不使用 deep_memory
retriever = db.as_retriever()
retriever.search_kwargs["deep_memory"] = False
retriever.search_kwargs["k"] = 10
query = "Deamination of cytidine to uridine on the minus strand of viral DNA results in catastrophic G-to-A mutations in the viral genome."
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4"), chain_type="stuff", retriever=retriever
)
qa.run(query)
The text does not provide information on the base htype of the 'video_seq' tensor.
3.4 Deep Memory 成本節省
Deep Memory 在不改變您現有工作流程的情況下提高了檢索準確性。此外,透過減少 LLM 的 top_k 輸入,您可以透過更低的 token 使用量顯著降低推理成本。