跳到主要內容
Open In ColabOpen on GitHub

Activeloop Deep Lake

Activeloop Deep Lake 作為多模態向量儲存庫,可儲存嵌入及其元數據,包括文字、json、圖像、音訊、影片等。它將數據儲存在本地、您的雲端或 Activeloop 儲存空間中。它執行混合搜尋,包括嵌入及其屬性。

本筆記本展示了與 Activeloop Deep Lake 相關的基本功能。雖然 Deep Lake 可以儲存嵌入,但它能夠儲存任何類型的數據。它是一個具有版本控制、查詢引擎和串流數據載入器到深度學習框架的無伺服器數據湖。

如需更多資訊,請參閱 Deep Lake 文件

設定

%pip install --upgrade --quiet  langchain-openai langchain-deeplake tiktoken

Activeloop 提供的範例

與 LangChain 整合.

Deep Lake 本地端

from langchain_deeplake.vectorstores import DeeplakeVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

if "ACTIVELOOP_TOKEN" not in os.environ:
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass("activeloop token:")
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
API 參考:TextLoader

建立本地數據集

./my_deeplake/ 本地建立數據集,然後執行相似性搜尋。Deeplake+LangChain 整合在底層使用 Deep Lake 數據集,因此 datasetvector store 可以互換使用。若要在您自己的雲端或 Deep Lake 儲存空間中建立數據集,請相應地調整路徑

db = DeeplakeVectorStore(
dataset_path="./my_deeplake/", embedding_function=embeddings, overwrite=True
)
db.add_documents(docs)
# or shorter
# db = DeepLake.from_documents(docs, dataset_path="./my_deeplake/", embedding_function=embeddings, overwrite=True)

查詢數據集

query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)

稍後,您可以重新載入數據集,而無需重新計算嵌入

db = DeeplakeVectorStore(
dataset_path="./my_deeplake/", embedding_function=embeddings, read_only=True
)
docs = db.similarity_search(query)

設定 read_only=True 可防止在不需要更新時意外修改向量儲存庫。這確保數據保持不變,除非明確地打算修改。通常建議指定此參數以避免意外更新。

檢索問題/回答

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-3.5-turbo"),
chain_type="stuff",
retriever=db.as_retriever(),
)
API 參考:RetrievalQA | ChatOpenAI
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)

基於元數據的屬性篩選

讓我們建立另一個包含元數據的向量儲存庫,其中包含文件建立年份。

import random

for d in docs:
d.metadata["year"] = random.randint(2012, 2014)

db = DeeplakeVectorStore.from_documents(
docs, embeddings, dataset_path="./my_deeplake/", overwrite=True
)
db.similarity_search(
"What did the president say about Ketanji Brown Jackson",
filter={"metadata": {"year": 2013}},
)

選擇距離函數

距離函數 L2 用於歐幾里得距離,cos 用於餘弦相似度

db.similarity_search(
"What did the president say about Ketanji Brown Jackson?", distance_metric="l2"
)

最大邊際相關性

使用最大邊際相關性

db.max_marginal_relevance_search(
"What did the president say about Ketanji Brown Jackson?"
)

刪除數據集

db.delete_dataset()

雲端(Activeloop、AWS、GCS 等)或記憶體中的 Deep Lake 數據集

預設情況下,Deep Lake 數據集儲存在本地。若要將它們儲存在記憶體中、Deep Lake Managed DB 中或任何物件儲存空間中,您可以在建立向量儲存庫時提供相應的路徑和憑證。某些路徑需要向 Activeloop 註冊並建立 API 令牌,該令牌可以從此處檢索

os.environ["ACTIVELOOP_TOKEN"] = activeloop_token
# Embed and store the texts
username = "<USERNAME_OR_ORG>" # your username on app.activeloop.ai
dataset_path = f"hub://{username}/langchain_testing_python" # could be also ./local/path (much faster locally), s3://bucket/path/to/dataset, gcs://path/to/dataset, etc.

docs = text_splitter.split_documents(documents)

embedding = OpenAIEmbeddings()
db = DeeplakeVectorStore(
dataset_path=dataset_path, embedding_function=embeddings, overwrite=True
)
ids = db.add_documents(docs)
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
# Embed and store the texts
username = "<USERNAME_OR_ORG>" # your username on app.activeloop.ai
dataset_path = f"hub://{username}/langchain_testing"

docs = text_splitter.split_documents(documents)

embedding = OpenAIEmbeddings()
db = DeeplakeVectorStore(
dataset_path=dataset_path,
embedding_function=embeddings,
overwrite=True,
)
ids = db.add_documents(docs)

此外,在 similarity_search 方法中也支援查詢的執行,其中可以使用 Deep Lake 的張量查詢語言 (TQL) 指定查詢。

search_id = db.dataset["ids"][0]
docs = db.similarity_search(
query=None,
tql=f"SELECT * WHERE ids == '{search_id}'",
)
db.dataset.summary()

在 AWS S3 上建立向量儲存庫

dataset_path = "s3://BUCKET/langchain_test"  # could be also ./local/path (much faster locally), hub://bucket/path/to/dataset, gcs://path/to/dataset, etc.

embedding = OpenAIEmbeddings()
db = DeeplakeVectorStore.from_documents(
docs,
dataset_path=dataset_path,
embedding=embeddings,
overwrite=True,
creds={
"aws_access_key_id": os.environ["AWS_ACCESS_KEY_ID"],
"aws_secret_access_key": os.environ["AWS_SECRET_ACCESS_KEY"],
"aws_session_token": os.environ["AWS_SESSION_TOKEN"], # Optional
},
)

Deep Lake API

您可以存取 db.vectorstore 中的 Deep Lake 數據集

# get structure of the dataset
db.dataset.summary()
# get embeddings numpy array
embeds = db.dataset["embeddings"][:]

將本地數據集傳輸到雲端

將已建立的數據集複製到雲端。您也可以從雲端傳輸到本地。

import deeplake

username = "<USERNAME_OR_ORG>" # your username on app.activeloop.ai
source = f"hub://{username}/langchain_testing" # could be local, s3, gcs, etc.
destination = f"hub://{username}/langchain_test_copy" # could be local, s3, gcs, etc.


deeplake.copy(src=source, dst=destination)
db = DeeplakeVectorStore(dataset_path=destination, embedding_function=embeddings)
db.add_documents(docs)

此頁面是否對您有幫助?