Weaviate
本筆記本涵蓋如何開始在 LangChain 中使用 Weaviate 向量儲存庫,使用 langchain-weaviate
套件。
Weaviate 是一個開放原始碼的向量資料庫。它讓您可以儲存來自您最愛的 ML 模型的資料物件和向量嵌入,並無縫擴展到數十億個資料物件。
若要使用此整合,您需要有一個正在運行的 Weaviate 資料庫實例。
最低版本
此模組需要 Weaviate 1.23.7
或更高版本。但是,我們建議您使用最新版本的 Weaviate。
連線到 Weaviate
在本筆記本中,我們假設您在本機執行 Weaviate 實例於 https://127.0.0.1:8080
,並且 port 50051 為 gRPC 流量 開放。因此,我們將使用以下方式連線到 Weaviate
weaviate_client = weaviate.connect_to_local()
其他部署選項
Weaviate 可以透過多種不同的方式部署,例如使用 Weaviate Cloud Services (WCS)、Docker 或 Kubernetes。
如果您的 Weaviate 實例以其他方式部署,請在此處閱讀更多資訊,了解連線到 Weaviate 的不同方式。您可以使用不同的輔助函式或建立自訂實例。
請注意,您需要
v4
用戶端 API,這將建立一個weaviate.WeaviateClient
物件。
身份驗證
某些 Weaviate 實例(例如在 WCS 上運行的實例)已啟用身份驗證,例如 API 金鑰和/或使用者名稱+密碼身份驗證。
請閱讀用戶端身份驗證指南以取得更多資訊,以及深入的身份驗證配置頁面。
安裝
# install package
# %pip install -Uqq langchain-weaviate
# %pip install openai tiktoken langchain
環境設定
本筆記本透過 OpenAIEmbeddings
使用 OpenAI API。我們建議取得 OpenAI API 金鑰,並將其匯出為名為 OPENAI_API_KEY
的環境變數。
完成後,您的 OpenAI API 金鑰將會自動讀取。如果您不熟悉環境變數,請在此處或本指南中閱讀更多相關資訊。
用法
依相似性尋找物件
以下範例說明如何依據與查詢的相似性來尋找物件,從資料匯入到查詢 Weaviate 實例。
步驟 1:資料匯入
首先,我們將建立要新增到 Weaviate
的資料,方法是載入長文字檔的內容並進行分塊。
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.embeddings.openai.OpenAIEmbeddings` was deprecated in langchain-community 0.1.0 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAIEmbeddings`.
warn_deprecated(
現在,我們可以匯入資料。
若要執行此操作,請連線到 Weaviate 實例並使用產生的 weaviate_client
物件。例如,我們可以匯入文件,如下所示
import weaviate
from langchain_weaviate.vectorstores import WeaviateVectorStore
weaviate_client = weaviate.connect_to_local()
db = WeaviateVectorStore.from_documents(docs, embeddings, client=weaviate_client)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
步驟 2:執行搜尋
我們現在可以執行相似性搜尋。這將根據儲存在 Weaviate 中的嵌入以及從查詢文字產生的等效嵌入,傳回與查詢文字最相似的文件。
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
# Print the first 100 characters of each result
for i, doc in enumerate(docs):
print(f"\nDocument {i+1}:")
print(doc.page_content[:100] + "...")
Document 1:
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Ac...
Document 2:
And so many families are living paycheck to paycheck, struggling to keep up with the rising cost of ...
Document 3:
Vice President Harris and I ran for office with a new economic vision for America.
Invest in Ameri...
Document 4:
A former top litigator in private practice. A former federal public defender. And from a family of p...
您也可以新增篩選器,根據篩選條件包含或排除結果。(請參閱更多篩選器範例。)
from weaviate.classes.query import Filter
for filter_str in ["blah.txt", "state_of_the_union.txt"]:
search_filter = Filter.by_property("source").equal(filter_str)
filtered_search_results = db.similarity_search(query, filters=search_filter)
print(len(filtered_search_results))
if filter_str == "state_of_the_union.txt":
assert len(filtered_search_results) > 0 # There should be at least one result
else:
assert len(filtered_search_results) == 0 # There should be no results
0
4
也可以提供 k
,這是要傳回的結果數上限。
search_filter = Filter.by_property("source").equal("state_of_the_union.txt")
filtered_search_results = db.similarity_search(query, filters=search_filter, k=3)
assert len(filtered_search_results) <= 3
量化結果相似性
您可以選擇性地檢索相關性「分數」。這是一個相對分數,表示特定搜尋結果在搜尋結果池中的品質。
請注意,這是相對分數,表示不應用於判斷相關性的閾值。但是,它可用於比較整個搜尋結果集中不同搜尋結果的相關性。
docs = db.similarity_search_with_score("country", k=5)
for doc in docs:
print(f"{doc[1]:.3f}", ":", doc[0].page_content[:100] + "...")
0.935 : For that purpose we’ve mobilized American ground forces, air squadrons, and ship deployments to prot...
0.500 : And built the strongest, freest, and most prosperous nation the world has ever known.
Now is the h...
0.462 : If you travel 20 miles east of Columbus, Ohio, you’ll find 1,000 empty acres of land.
It won’t loo...
0.450 : And my report is this: the State of the Union is strong—because you, the American people, are strong...
0.442 : Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Ac...
搜尋機制
similarity_search
使用 Weaviate 的混合搜尋。
混合搜尋結合了向量和關鍵字搜尋,其中 alpha
作為向量搜尋的權重。similarity_search
函式可讓您將額外引數作為 kwargs 傳遞。請參閱此參考文件,了解可用的引數。
因此,您可以透過新增 alpha=0
來執行純關鍵字搜尋,如下所示
docs = db.similarity_search(query, alpha=0)
docs[0]
Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'state_of_the_union.txt'})
持久性
透過 langchain-weaviate
新增的任何資料都將根據其組態在 Weaviate 中持續存在。
例如,WCS 實例設定為無限期地保存資料,而 Docker 實例可以設定為將資料保存在磁碟區中。請閱讀更多關於 Weaviate 的持久性。
多租戶
多租戶可讓您在單一 Weaviate 實例中擁有大量隔離的資料集合,且具有相同的集合組態。這對於多使用者環境非常有用,例如建立 SaaS 應用程式,每個終端使用者都將擁有自己隔離的資料集合。
若要使用多租戶,向量儲存庫需要知道 tenant
參數。
因此,在新增任何資料時,請提供 tenant
參數,如下所示。
db_with_mt = WeaviateVectorStore.from_documents(
docs, embeddings, client=weaviate_client, tenant="Foo"
)
2024-Mar-26 03:40 PM - langchain_weaviate.vectorstores - INFO - Tenant Foo does not exist in index LangChain_30b9273d43b3492db4fb2aba2e0d6871. Creating tenant.
並且在執行查詢時,也請提供 tenant
參數。
db_with_mt.similarity_search(query, tenant="Foo")
[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'state_of_the_union.txt'}),
Document(page_content='And so many families are living paycheck to paycheck, struggling to keep up with the rising cost of food, gas, housing, and so much more. \n\nI understand. \n\nI remember when my Dad had to leave our home in Scranton, Pennsylvania to find work. I grew up in a family where if the price of food went up, you felt it. \n\nThat’s why one of the first things I did as President was fight to pass the American Rescue Plan. \n\nBecause people were hurting. We needed to act, and we did. \n\nFew pieces of legislation have done more in a critical moment in our history to lift us out of crisis. \n\nIt fueled our efforts to vaccinate the nation and combat COVID-19. It delivered immediate economic relief for tens of millions of Americans. \n\nHelped put food on their table, keep a roof over their heads, and cut the cost of health insurance. \n\nAnd as my Dad used to say, it gave people a little breathing room.', metadata={'source': 'state_of_the_union.txt'}),
Document(page_content='He and his Dad both have Type 1 diabetes, which means they need insulin every day. Insulin costs about $10 a vial to make. \n\nBut drug companies charge families like Joshua and his Dad up to 30 times more. I spoke with Joshua’s mom. \n\nImagine what it’s like to look at your child who needs insulin and have no idea how you’re going to pay for it. \n\nWhat it does to your dignity, your ability to look your child in the eye, to be the parent you expect to be. \n\nJoshua is here with us tonight. Yesterday was his birthday. Happy birthday, buddy. \n\nFor Joshua, and for the 200,000 other young people with Type 1 diabetes, let’s cap the cost of insulin at $35 a month so everyone can afford it. \n\nDrug companies will still do very well. And while we’re at it let Medicare negotiate lower prices for prescription drugs, like the VA already does.', metadata={'source': 'state_of_the_union.txt'}),
Document(page_content='Putin’s latest attack on Ukraine was premeditated and unprovoked. \n\nHe rejected repeated efforts at diplomacy. \n\nHe thought the West and NATO wouldn’t respond. And he thought he could divide us at home. Putin was wrong. We were ready. Here is what we did. \n\nWe prepared extensively and carefully. \n\nWe spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin. \n\nI spent countless hours unifying our European allies. We shared with the world in advance what we knew Putin was planning and precisely how he would try to falsely justify his aggression. \n\nWe countered Russia’s lies with truth. \n\nAnd now that he has acted the free world is holding him accountable. \n\nAlong with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland.', metadata={'source': 'state_of_the_union.txt'})]
檢索器選項
Weaviate 也可以用作檢索器
最大邊際相關性搜尋 (MMR)
除了在檢索器物件中使用 similaritysearch 之外,您也可以使用 mmr
。
retriever = db.as_retriever(search_type="mmr")
retriever.invoke(query)[0]
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'state_of_the_union.txt'})
與 LangChain 搭配使用
大型語言模型 (LLM) 的一個已知限制是其訓練資料可能已過時,或未包含您需要的特定領域知識。
請看以下範例
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
llm.predict("What did the president say about Justice Breyer")
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.chat_models.openai.ChatOpenAI` was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import ChatOpenAI`.
warn_deprecated(
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `predict` was deprecated in LangChain 0.1.7 and will be removed in 0.2.0. Use invoke instead.
warn_deprecated(
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
"I'm sorry, I cannot provide real-time information as my responses are generated based on a mixture of licensed data, data created by human trainers, and publicly available data. The last update was in October 2021."
向量儲存庫透過提供儲存和檢索相關資訊的方式來補充 LLM。這讓您可以結合 LLM 和向量儲存庫的優勢,方法是將 LLM 的推理和語言能力與向量儲存庫檢索相關資訊的能力結合使用。
結合 LLM 和向量儲存庫的兩個著名應用是
- 問答
- 檢索增強生成 (RAG)
帶來源的問答
在 langchain 中的問答可以透過使用向量儲存庫來增強。讓我們看看如何做到這一點。
本節使用 RetrievalQAWithSourcesChain
,它會從索引中查詢文件。
首先,我們將再次對文字進行分塊,並將其匯入到 Weaviate 向量儲存庫中。
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import OpenAI
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
docsearch = WeaviateVectorStore.from_texts(
texts,
embeddings,
client=weaviate_client,
metadatas=[{"source": f"{i}-pl"} for i in range(len(texts))],
)
現在我們可以建構鏈,並指定檢索器
chain = RetrievalQAWithSourcesChain.from_chain_type(
OpenAI(temperature=0), chain_type="stuff", retriever=docsearch.as_retriever()
)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.llms.openai.OpenAI` was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAI`.
warn_deprecated(
並執行鏈,以提出問題
chain(
{"question": "What did the president say about Justice Breyer"},
return_only_outputs=True,
)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `__call__` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.
warn_deprecated(
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
{'answer': ' The president thanked Justice Stephen Breyer for his service and announced his nomination of Judge Ketanji Brown Jackson to the Supreme Court.\n',
'sources': '31-pl'}
檢索增強生成
結合 LLM 和向量儲存庫的另一個非常流行的應用是檢索增強生成 (RAG)。這是一種使用檢索器從向量儲存庫中尋找相關資訊的技術,然後使用 LLM 根據檢索到的資料和提示提供輸出。
我們從類似的設定開始
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
docsearch = WeaviateVectorStore.from_texts(
texts,
embeddings,
client=weaviate_client,
metadatas=[{"source": f"{i}-pl"} for i in range(len(texts))],
)
retriever = docsearch.as_retriever()
我們需要為 RAG 模型建構一個範本,以便檢索到的資訊將填入範本中。
from langchain_core.prompts import ChatPromptTemplate
template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)
print(prompt)
input_variables=['context', 'question'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question}\nContext: {context}\nAnswer:\n"))]
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
執行儲存格,我們得到非常相似的輸出。
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
rag_chain.invoke("What did the president say about Justice Breyer")
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
"The president honored Justice Stephen Breyer for his service to the country as an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. The president also mentioned nominating Circuit Court of Appeals Judge Ketanji Brown Jackson to continue Justice Breyer's legacy of excellence. The president expressed gratitude towards Justice Breyer and highlighted the importance of nominating someone to serve on the United States Supreme Court."
但請注意,由於範本由您自行建構,您可以根據您的需求自訂它。
總結與資源
Weaviate 是一個可擴展、可投入生產環境的向量資料庫。
此整合讓 Weaviate 可以與 LangChain 搭配使用,透過穩健的資料儲存來增強大型語言模型的功能。其可擴展性和可投入生產環境的特性,使其成為 LangChain 應用程式向量資料庫的絕佳選擇,並能縮短您的產品上市時間。