Activeloop Deep Lake
Activeloop Deep Lake 作為多模態向量儲存,儲存嵌入及其元數據,包括文字、Jsons、圖片、音訊、影片等。 它將資料儲存在本機、您的雲端或 Activeloop 儲存中。 它執行混合搜尋,包括嵌入及其屬性。
此筆記本展示了與 Activeloop Deep Lake
相關的基本功能。 雖然 Deep Lake
可以儲存嵌入,但它能夠儲存任何類型的資料。 它是一個具有版本控制、查詢引擎和串流資料載入器的無伺服器資料湖,可支援深度學習框架。
如需更多資訊,請參閱 Deep Lake 文件 或 API 參考
設定
%pip install --upgrade --quiet langchain-openai langchain-community 'deeplake[enterprise]' tiktoken
Activeloop 提供的範例
Deep Lake 本機
from langchain_community.vectorstores import DeepLake
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
import getpass
import os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
activeloop_token = getpass.getpass("activeloop token:")
embeddings = OpenAIEmbeddings()
from langchain_community.document_loaders import TextLoader
loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
建立本機資料集
在本機 ./deeplake/
建立資料集,然後執行相似性搜尋。 Deeplake+LangChain 整合在底層使用 Deep Lake 資料集,因此 dataset
和 vector store
可以互換使用。 若要在您自己的雲端或 Deep Lake 儲存中建立資料集,請相應地調整路徑。
db = DeepLake(dataset_path="./my_deeplake/", embedding=embeddings, overwrite=True)
db.add_documents(docs)
# or shorter
# db = DeepLake.from_documents(docs, dataset_path="./my_deeplake/", embedding=embeddings, overwrite=True)
查詢資料集
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
Dataset(path='./my_deeplake/', tensors=['embedding', 'id', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding embedding (42, 1536) float32 None
id text (42, 1) str None
metadata json (42, 1) str None
text text (42, 1) str None
若要隨時停用資料集摘要列印,您可以在 VectorStore 初始化期間指定 verbose=False。
print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
稍後,您可以重新載入資料集,而無需重新計算嵌入
db = DeepLake(dataset_path="./my_deeplake/", embedding=embeddings, read_only=True)
docs = db.similarity_search(query)
Deep Lake Dataset in ./my_deeplake/ already exists, loading from the storage
Deep Lake 目前是單一寫入器和多個讀取器。 設定 read_only=True
有助於避免取得寫入器鎖定。
檢索問答
from langchain.chains import RetrievalQA
from langchain_openai import OpenAIChat
qa = RetrievalQA.from_chain_type(
llm=OpenAIChat(model="gpt-3.5-turbo"),
chain_type="stuff",
retriever=db.as_retriever(),
)
/home/ubuntu/langchain_activeloop/langchain/libs/langchain/langchain/llms/openai.py:786: UserWarning: You are trying to use a chat model. This way of initializing it is no longer supported. Instead, please use: `from langchain_openai import ChatOpenAI`
warnings.warn(
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)
'The president said that Ketanji Brown Jackson is a former top litigator in private practice and a former federal public defender. She comes from a family of public school educators and police officers. She is a consensus builder and has received a broad range of support since being nominated.'
基於元數據的屬性過濾
讓我們建立另一個向量儲存,其中包含文件建立年份的元數據。
import random
for d in docs:
d.metadata["year"] = random.randint(2012, 2014)
db = DeepLake.from_documents(
docs, embeddings, dataset_path="./my_deeplake/", overwrite=True
)
Dataset(path='./my_deeplake/', tensors=['embedding', 'id', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding embedding (4, 1536) float32 None
id text (4, 1) str None
metadata json (4, 1) str None
text text (4, 1) str None
db.similarity_search(
"What did the president say about Ketanji Brown Jackson",
filter={"metadata": {"year": 2013}},
)
100%|██████████| 4/4 [00:00<00:00, 2936.16it/s]
[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../how_to/state_of_the_union.txt', 'year': 2013}),
Document(page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n\nWe can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. \n\nWe’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n\nWe’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n\nWe’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.', metadata={'source': '../../how_to/state_of_the_union.txt', 'year': 2013}),
Document(page_content='Tonight, I’m announcing a crackdown on these companies overcharging American businesses and consumers. \n\nAnd as Wall Street firms take over more nursing homes, quality in those homes has gone down and costs have gone up. \n\nThat ends on my watch. \n\nMedicare is going to set higher standards for nursing homes and make sure your loved ones get the care they deserve and expect. \n\nWe’ll also cut costs and keep the economy going strong by giving workers a fair shot, provide more training and apprenticeships, hire them based on their skills not degrees. \n\nLet’s pass the Paycheck Fairness Act and paid leave. \n\nRaise the minimum wage to $15 an hour and extend the Child Tax Credit, so no one has to raise a family in poverty. \n\nLet’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges.', metadata={'source': '../../how_to/state_of_the_union.txt', 'year': 2013})]
選擇距離函數
距離函數 L2
代表歐幾里得距離,L1
代表核距離,Max
代表 l-infinity 距離,cos
代表餘弦相似度,dot
代表點積
db.similarity_search(
"What did the president say about Ketanji Brown Jackson?", distance_metric="cos"
)
[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../how_to/state_of_the_union.txt', 'year': 2013}),
Document(page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n\nWe can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. \n\nWe’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n\nWe’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n\nWe’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.', metadata={'source': '../../how_to/state_of_the_union.txt', 'year': 2013}),
Document(page_content='Tonight, I’m announcing a crackdown on these companies overcharging American businesses and consumers. \n\nAnd as Wall Street firms take over more nursing homes, quality in those homes has gone down and costs have gone up. \n\nThat ends on my watch. \n\nMedicare is going to set higher standards for nursing homes and make sure your loved ones get the care they deserve and expect. \n\nWe’ll also cut costs and keep the economy going strong by giving workers a fair shot, provide more training and apprenticeships, hire them based on their skills not degrees. \n\nLet’s pass the Paycheck Fairness Act and paid leave. \n\nRaise the minimum wage to $15 an hour and extend the Child Tax Credit, so no one has to raise a family in poverty. \n\nLet’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges.', metadata={'source': '../../how_to/state_of_the_union.txt', 'year': 2013}),
Document(page_content='And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. \n\nAs I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \n\nWhile it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. \n\nAnd soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. \n\nSo tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together. \n\nFirst, beat the opioid epidemic.', metadata={'source': '../../how_to/state_of_the_union.txt', 'year': 2012})]
最大邊緣相關性
使用最大邊緣相關性
db.max_marginal_relevance_search(
"What did the president say about Ketanji Brown Jackson?"
)
[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../how_to/state_of_the_union.txt', 'year': 2013}),
Document(page_content='Tonight, I’m announcing a crackdown on these companies overcharging American businesses and consumers. \n\nAnd as Wall Street firms take over more nursing homes, quality in those homes has gone down and costs have gone up. \n\nThat ends on my watch. \n\nMedicare is going to set higher standards for nursing homes and make sure your loved ones get the care they deserve and expect. \n\nWe’ll also cut costs and keep the economy going strong by giving workers a fair shot, provide more training and apprenticeships, hire them based on their skills not degrees. \n\nLet’s pass the Paycheck Fairness Act and paid leave. \n\nRaise the minimum wage to $15 an hour and extend the Child Tax Credit, so no one has to raise a family in poverty. \n\nLet’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges.', metadata={'source': '../../how_to/state_of_the_union.txt', 'year': 2013}),
Document(page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n\nWe can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. \n\nWe’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n\nWe’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n\nWe’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.', metadata={'source': '../../how_to/state_of_the_union.txt', 'year': 2013}),
Document(page_content='And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. \n\nAs I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \n\nWhile it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. \n\nAnd soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. \n\nSo tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together. \n\nFirst, beat the opioid epidemic.', metadata={'source': '../../how_to/state_of_the_union.txt', 'year': 2012})]
刪除資料集
db.delete_dataset()
如果刪除失敗,您也可以強制刪除
DeepLake.force_delete_by_path("./my_deeplake")
雲端(Activeloop、AWS、GCS 等)或記憶體中的 Deep Lake 資料集
預設情況下,Deep Lake 資料集儲存在本機。 若要將它們儲存在記憶體、Deep Lake Managed DB 或任何物件儲存中,您可以在建立向量儲存時提供相應的路徑和憑證。 某些路徑需要向 Activeloop 註冊並建立一個 API 令牌,該令牌可以從此處檢索
os.environ["ACTIVELOOP_TOKEN"] = activeloop_token
# Embed and store the texts
username = "<USERNAME_OR_ORG>" # your username on app.activeloop.ai
dataset_path = f"hub://{username}/langchain_testing_python" # could be also ./local/path (much faster locally), s3://bucket/path/to/dataset, gcs://path/to/dataset, etc.
docs = text_splitter.split_documents(documents)
embedding = OpenAIEmbeddings()
db = DeepLake(dataset_path=dataset_path, embedding=embeddings, overwrite=True)
ids = db.add_documents(docs)
Your Deep Lake dataset has been successfully created!
``````output
Dataset(path='hub://adilkhan/langchain_testing_python', tensors=['embedding', 'id', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding embedding (42, 1536) float32 None
id text (42, 1) str None
metadata json (42, 1) str None
text text (42, 1) str None
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
tensor_db
執行選項
為了利用 Deep Lake 的 Managed Tensor Database,必須在建立向量儲存期間將運行時參數指定為 {'tensor_db': True}
。 此配置允許在 Managed Tensor Database 上執行查詢,而不是在客戶端執行。 應該注意的是,此功能不適用於本地或記憶體中儲存的資料集。 如果已在 Managed Tensor Database 之外建立向量儲存,則可以按照規定的步驟將其傳輸到 Managed Tensor Database。
# Embed and store the texts
username = "<USERNAME_OR_ORG>" # your username on app.activeloop.ai
dataset_path = f"hub://{username}/langchain_testing"
docs = text_splitter.split_documents(documents)
embedding = OpenAIEmbeddings()
db = DeepLake(
dataset_path=dataset_path,
embedding=embeddings,
overwrite=True,
runtime={"tensor_db": True},
)
ids = db.add_documents(docs)
Your Deep Lake dataset has been successfully created!
``````output
|
``````output
Dataset(path='hub://adilkhan/langchain_testing', tensors=['embedding', 'id', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding embedding (42, 1536) float32 None
id text (42, 1) str None
metadata json (42, 1) str None
text text (42, 1) str None
TQL 搜尋
此外,similarity_search 方法也支援查詢執行,其中可以利用 Deep Lake 的 Tensor Query Language (TQL) 指定查詢。
search_id = db.vectorstore.dataset.id[0].numpy()
search_id[0]
'8a6ff326-3a85-11ee-b840-13905694aaaf'
docs = db.similarity_search(
query=None,
tql=f"SELECT * WHERE id == '{search_id[0]}'",
)
db.vectorstore.summary()
Dataset(path='hub://adilkhan/langchain_testing', tensors=['embedding', 'id', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding embedding (42, 1536) float32 None
id text (42, 1) str None
metadata json (42, 1) str None
text text (42, 1) str None
在 AWS S3 上建立向量儲存
dataset_path = "s3://BUCKET/langchain_test" # could be also ./local/path (much faster locally), hub://bucket/path/to/dataset, gcs://path/to/dataset, etc.
embedding = OpenAIEmbeddings()
db = DeepLake.from_documents(
docs,
dataset_path=dataset_path,
embedding=embeddings,
overwrite=True,
creds={
"aws_access_key_id": os.environ["AWS_ACCESS_KEY_ID"],
"aws_secret_access_key": os.environ["AWS_SECRET_ACCESS_KEY"],
"aws_session_token": os.environ["AWS_SESSION_TOKEN"], # Optional
},
)
s3://hub-2.0-datasets-n/langchain_test loaded successfully.
``````output
Evaluating ingest: 100%|██████████| 1/1 [00:10<00:00
\
``````output
Dataset(path='s3://hub-2.0-datasets-n/langchain_test', tensors=['embedding', 'ids', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding generic (4, 1536) float32 None
ids text (4, 1) str None
metadata json (4, 1) str None
text text (4, 1) str None
Deep Lake API
你可以透過 db.vectorstore
存取 Deep Lake 資料集
# get structure of the dataset
db.vectorstore.summary()
Dataset(path='hub://adilkhan/langchain_testing', tensors=['embedding', 'id', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding embedding (42, 1536) float32 None
id text (42, 1) str None
metadata json (42, 1) str None
text text (42, 1) str None
# get embeddings numpy array
embeds = db.vectorstore.dataset.embedding.numpy()
將本地資料集傳輸到雲端
將已建立的資料集複製到雲端。您也可以從雲端傳輸到本地。
import deeplake
username = "davitbun" # your username on app.activeloop.ai
source = f"hub://{username}/langchain_testing" # could be local, s3, gcs, etc.
destination = f"hub://{username}/langchain_test_copy" # could be local, s3, gcs, etc.
deeplake.deepcopy(src=source, dest=destination, overwrite=True)
Copying dataset: 100%|██████████| 56/56 [00:38<00:00
``````output
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/davitbun/langchain_test_copy
Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!
Dataset(path='hub://davitbun/langchain_test_copy', tensors=['embedding', 'ids', 'metadata', 'text'])
db = DeepLake(dataset_path=destination, embedding=embeddings)
db.add_documents(docs)
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/davitbun/langchain_test_copy
``````output
/
``````output
hub://davitbun/langchain_test_copy loaded successfully.
``````output
Deep Lake Dataset in hub://davitbun/langchain_test_copy already exists, loading from the storage
``````output
Dataset(path='hub://davitbun/langchain_test_copy', tensors=['embedding', 'ids', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding generic (4, 1536) float32 None
ids text (4, 1) str None
metadata json (4, 1) str None
text text (4, 1) str None
``````output
Evaluating ingest: 100%|██████████| 1/1 [00:31<00:00
-
``````output
Dataset(path='hub://davitbun/langchain_test_copy', tensors=['embedding', 'ids', 'metadata', 'text'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding generic (8, 1536) float32 None
ids text (8, 1) str None
metadata json (8, 1) str None
text text (8, 1) str None
['ad42f3fe-e188-11ed-b66d-41c5f7b85421',
'ad42f3ff-e188-11ed-b66d-41c5f7b85421',
'ad42f400-e188-11ed-b66d-41c5f7b85421',
'ad42f401-e188-11ed-b66d-41c5f7b85421']