Activeloop Deep Lake
Activeloop Deep Lake 作為多模態向量儲存庫,可儲存嵌入及其元數據,包括文字、json、圖像、音訊、影片等。它將數據儲存在本地、您的雲端或 Activeloop 儲存空間中。它執行混合搜尋,包括嵌入及其屬性。
本筆記本展示了與 Activeloop Deep Lake
相關的基本功能。雖然 Deep Lake
可以儲存嵌入,但它能夠儲存任何類型的數據。它是一個具有版本控制、查詢引擎和串流數據載入器到深度學習框架的無伺服器數據湖。
如需更多資訊,請參閱 Deep Lake 文件
設定
%pip install --upgrade --quiet langchain-openai langchain-deeplake tiktoken
Activeloop 提供的範例
Deep Lake 本地端
from langchain_deeplake.vectorstores import DeeplakeVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
import getpass
import os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
if "ACTIVELOOP_TOKEN" not in os.environ:
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass("activeloop token:")
from langchain_community.document_loaders import TextLoader
loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
建立本地數據集
在 ./my_deeplake/
本地建立數據集,然後執行相似性搜尋。Deeplake+LangChain 整合在底層使用 Deep Lake 數據集,因此 dataset
和 vector store
可以互換使用。若要在您自己的雲端或 Deep Lake 儲存空間中建立數據集,請相應地調整路徑。
db = DeeplakeVectorStore(
dataset_path="./my_deeplake/", embedding_function=embeddings, overwrite=True
)
db.add_documents(docs)
# or shorter
# db = DeepLake.from_documents(docs, dataset_path="./my_deeplake/", embedding_function=embeddings, overwrite=True)
查詢數據集
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
稍後,您可以重新載入數據集,而無需重新計算嵌入
db = DeeplakeVectorStore(
dataset_path="./my_deeplake/", embedding_function=embeddings, read_only=True
)
docs = db.similarity_search(query)
設定 read_only=True
可防止在不需要更新時意外修改向量儲存庫。這確保數據保持不變,除非明確地打算修改。通常建議指定此參數以避免意外更新。
檢索問題/回答
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-3.5-turbo"),
chain_type="stuff",
retriever=db.as_retriever(),
)
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)
基於元數據的屬性篩選
讓我們建立另一個包含元數據的向量儲存庫,其中包含文件建立年份。
import random
for d in docs:
d.metadata["year"] = random.randint(2012, 2014)
db = DeeplakeVectorStore.from_documents(
docs, embeddings, dataset_path="./my_deeplake/", overwrite=True
)
db.similarity_search(
"What did the president say about Ketanji Brown Jackson",
filter={"metadata": {"year": 2013}},
)
選擇距離函數
距離函數 L2
用於歐幾里得距離,cos
用於餘弦相似度
db.similarity_search(
"What did the president say about Ketanji Brown Jackson?", distance_metric="l2"
)
最大邊際相關性
使用最大邊際相關性
db.max_marginal_relevance_search(
"What did the president say about Ketanji Brown Jackson?"
)
刪除數據集
db.delete_dataset()
雲端(Activeloop、AWS、GCS 等)或記憶體中的 Deep Lake 數據集
預設情況下,Deep Lake 數據集儲存在本地。若要將它們儲存在記憶體中、Deep Lake Managed DB 中或任何物件儲存空間中,您可以在建立向量儲存庫時提供相應的路徑和憑證。某些路徑需要向 Activeloop 註冊並建立 API 令牌,該令牌可以從此處檢索
os.environ["ACTIVELOOP_TOKEN"] = activeloop_token
# Embed and store the texts
username = "<USERNAME_OR_ORG>" # your username on app.activeloop.ai
dataset_path = f"hub://{username}/langchain_testing_python" # could be also ./local/path (much faster locally), s3://bucket/path/to/dataset, gcs://path/to/dataset, etc.
docs = text_splitter.split_documents(documents)
embedding = OpenAIEmbeddings()
db = DeeplakeVectorStore(
dataset_path=dataset_path, embedding_function=embeddings, overwrite=True
)
ids = db.add_documents(docs)
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
# Embed and store the texts
username = "<USERNAME_OR_ORG>" # your username on app.activeloop.ai
dataset_path = f"hub://{username}/langchain_testing"
docs = text_splitter.split_documents(documents)
embedding = OpenAIEmbeddings()
db = DeeplakeVectorStore(
dataset_path=dataset_path,
embedding_function=embeddings,
overwrite=True,
)
ids = db.add_documents(docs)
TQL 搜尋
此外,在 similarity_search 方法中也支援查詢的執行,其中可以使用 Deep Lake 的張量查詢語言 (TQL) 指定查詢。
search_id = db.dataset["ids"][0]
docs = db.similarity_search(
query=None,
tql=f"SELECT * WHERE ids == '{search_id}'",
)
db.dataset.summary()
在 AWS S3 上建立向量儲存庫
dataset_path = "s3://BUCKET/langchain_test" # could be also ./local/path (much faster locally), hub://bucket/path/to/dataset, gcs://path/to/dataset, etc.
embedding = OpenAIEmbeddings()
db = DeeplakeVectorStore.from_documents(
docs,
dataset_path=dataset_path,
embedding=embeddings,
overwrite=True,
creds={
"aws_access_key_id": os.environ["AWS_ACCESS_KEY_ID"],
"aws_secret_access_key": os.environ["AWS_SECRET_ACCESS_KEY"],
"aws_session_token": os.environ["AWS_SESSION_TOKEN"], # Optional
},
)
Deep Lake API
您可以存取 db.vectorstore
中的 Deep Lake 數據集
# get structure of the dataset
db.dataset.summary()
# get embeddings numpy array
embeds = db.dataset["embeddings"][:]
將本地數據集傳輸到雲端
將已建立的數據集複製到雲端。您也可以從雲端傳輸到本地。
import deeplake
username = "<USERNAME_OR_ORG>" # your username on app.activeloop.ai
source = f"hub://{username}/langchain_testing" # could be local, s3, gcs, etc.
destination = f"hub://{username}/langchain_test_copy" # could be local, s3, gcs, etc.
deeplake.copy(src=source, dst=destination)
db = DeeplakeVectorStore(dataset_path=destination, embedding_function=embeddings)
db.add_documents(docs)