跳至主要內容

Weaviate

此筆記本涵蓋如何開始在 LangChain 中使用 Weaviate 向量儲存,使用 langchain-weaviate 套件。

Weaviate 是一個開放原始碼的向量資料庫。它允許您儲存來自您最愛的 ML 模型的資料物件和向量嵌入,並無縫擴展到數十億個資料物件。

要使用此整合,您需要有一個正在運行的 Weaviate 資料庫實例。

最低版本

此模組需要 Weaviate 1.23.7 或更高版本。但是,我們建議您使用最新版本的 Weaviate。

連接到 Weaviate

在本筆記本中,我們假設您在 https://127.0.0.1:8080 上有一個 Weaviate 的本機實例正在運行,並且為 gRPC 流量 開放了 50051 連接埠。因此,我們將連接到 Weaviate:

weaviate_client = weaviate.connect_to_local()

其他部署選項

Weaviate 可以使用 Weaviate Cloud Services (WCS)DockerKubernetes許多不同的方式部署

如果您的 Weaviate 實例以另一種方式部署,請在此處閱讀更多關於連接到 Weaviate 的不同方式。您可以使用不同的輔助函數建立自訂實例

請注意,您需要 v4 客戶端 API,這將建立一個 weaviate.WeaviateClient 物件。

身份驗證

某些 Weaviate 實例(例如在 WCS 上運行的那些實例)已啟用身份驗證,例如 API 金鑰和/或使用者名 + 密碼身份驗證。

請閱讀客戶端身份驗證指南以獲取更多資訊,以及深入的身份驗證配置頁面

安裝

# install package
# %pip install -Uqq langchain-weaviate
# %pip install openai tiktoken langchain

環境設定

此筆記本透過 OpenAIEmbeddings 使用 OpenAI API。 我們建議獲取 OpenAI API 金鑰並將其匯出為名為 OPENAI_API_KEY 的環境變數。

完成後,您的 OpenAI API 金鑰將會自動讀取。 如果您是環境變數的新手,請在此處本指南中閱讀更多相關資訊。

使用方法

依相似性尋找物件

以下是如何依與查詢的相似性尋找物件的範例,從資料匯入到查詢 Weaviate 實例。

步驟 1:資料匯入

首先,我們將建立要新增到 Weaviate 的資料,方法是載入和分塊長文字檔案的內容。

from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.embeddings.openai.OpenAIEmbeddings` was deprecated in langchain-community 0.1.0 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAIEmbeddings`.
warn_deprecated(

現在,我們可以匯入資料。

為此,請連接到 Weaviate 實例並使用產生的 weaviate_client 物件。例如,我們可以匯入如下所示的文件

import weaviate
from langchain_weaviate.vectorstores import WeaviateVectorStore
weaviate_client = weaviate.connect_to_local()
db = WeaviateVectorStore.from_documents(docs, embeddings, client=weaviate_client)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)

我們現在可以執行相似性搜尋。這將根據 Weaviate 中儲存的嵌入和從查詢文字產生的等效嵌入,傳回與查詢文字最相似的文件。

query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

# Print the first 100 characters of each result
for i, doc in enumerate(docs):
print(f"\nDocument {i+1}:")
print(doc.page_content[:100] + "...")

Document 1:
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Ac...

Document 2:
And so many families are living paycheck to paycheck, struggling to keep up with the rising cost of ...

Document 3:
Vice President Harris and I ran for office with a new economic vision for America.

Invest in Ameri...

Document 4:
A former top litigator in private practice. A former federal public defender. And from a family of p...

您也可以新增篩選器,這將根據篩選條件包含或排除結果。(請參閱更多篩選器範例。)

from weaviate.classes.query import Filter

for filter_str in ["blah.txt", "state_of_the_union.txt"]:
search_filter = Filter.by_property("source").equal(filter_str)
filtered_search_results = db.similarity_search(query, filters=search_filter)
print(len(filtered_search_results))
if filter_str == "state_of_the_union.txt":
assert len(filtered_search_results) > 0 # There should be at least one result
else:
assert len(filtered_search_results) == 0 # There should be no results
0
4

也可以提供 k,這是要傳回的結果數上限。

search_filter = Filter.by_property("source").equal("state_of_the_union.txt")
filtered_search_results = db.similarity_search(query, filters=search_filter, k=3)
assert len(filtered_search_results) <= 3

量化結果相似性

您可以選擇性地檢索相關性「分數」。這是一個相對分數,表示特定搜尋結果在搜尋結果池中的好壞程度。

請注意,這是一個相對分數,這表示不應用於確定相關性的閾值。但是,它可用於比較整個搜尋結果集中不同搜尋結果的相關性。

docs = db.similarity_search_with_score("country", k=5)

for doc in docs:
print(f"{doc[1]:.3f}", ":", doc[0].page_content[:100] + "...")
0.935 : For that purpose we’ve mobilized American ground forces, air squadrons, and ship deployments to prot...
0.500 : And built the strongest, freest, and most prosperous nation the world has ever known.

Now is the h...
0.462 : If you travel 20 miles east of Columbus, Ohio, you’ll find 1,000 empty acres of land.

It won’t loo...
0.450 : And my report is this: the State of the Union is strong—because you, the American people, are strong...
0.442 : Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Ac...

搜尋機制

similarity_search 使用 Weaviate 的混合搜尋

混合搜尋結合了向量和關鍵字搜尋,其中 alpha 是向量搜尋的權重。 similarity_search 函數允許您將其他引數作為 kwargs 傳遞。有關可用引數,請參閱此參考文檔

因此,您可以透過新增 alpha=0 來執行純關鍵字搜尋,如下所示

docs = db.similarity_search(query, alpha=0)
docs[0]
Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'state_of_the_union.txt'})

持久化

透過 langchain-weaviate 新增的任何資料都會根據 Weaviate 的配置持久儲存在 Weaviate 中。

例如,WCS 實例配置為無限期地持久儲存資料,而 Docker 實例可以設定為將資料持久儲存在卷宗中。請閱讀更多關於 Weaviate 的持久化

多租戶

多租戶允許您在單個 Weaviate 實例中擁有大量隔離的資料集合,並具有相同的集合配置。 這對於多用戶環境非常有用,例如構建 SaaS 應用程式,其中每個最終使用者都將擁有自己隔離的資料集合。

要使用多租戶,向量儲存需要知道 tenant 參數。

因此,新增任何資料時,請提供如下所示的 tenant 參數。

db_with_mt = WeaviateVectorStore.from_documents(
docs, embeddings, client=weaviate_client, tenant="Foo"
)
2024-Mar-26 03:40 PM - langchain_weaviate.vectorstores - INFO - Tenant Foo does not exist in index LangChain_30b9273d43b3492db4fb2aba2e0d6871. Creating tenant.

並且在執行查詢時,也請提供 tenant 參數。

db_with_mt.similarity_search(query, tenant="Foo")
[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'state_of_the_union.txt'}),
Document(page_content='And so many families are living paycheck to paycheck, struggling to keep up with the rising cost of food, gas, housing, and so much more. \n\nI understand. \n\nI remember when my Dad had to leave our home in Scranton, Pennsylvania to find work. I grew up in a family where if the price of food went up, you felt it. \n\nThat’s why one of the first things I did as President was fight to pass the American Rescue Plan. \n\nBecause people were hurting. We needed to act, and we did. \n\nFew pieces of legislation have done more in a critical moment in our history to lift us out of crisis. \n\nIt fueled our efforts to vaccinate the nation and combat COVID-19. It delivered immediate economic relief for tens of millions of Americans. \n\nHelped put food on their table, keep a roof over their heads, and cut the cost of health insurance. \n\nAnd as my Dad used to say, it gave people a little breathing room.', metadata={'source': 'state_of_the_union.txt'}),
Document(page_content='He and his Dad both have Type 1 diabetes, which means they need insulin every day. Insulin costs about $10 a vial to make. \n\nBut drug companies charge families like Joshua and his Dad up to 30 times more. I spoke with Joshua’s mom. \n\nImagine what it’s like to look at your child who needs insulin and have no idea how you’re going to pay for it. \n\nWhat it does to your dignity, your ability to look your child in the eye, to be the parent you expect to be. \n\nJoshua is here with us tonight. Yesterday was his birthday. Happy birthday, buddy. \n\nFor Joshua, and for the 200,000 other young people with Type 1 diabetes, let’s cap the cost of insulin at $35 a month so everyone can afford it. \n\nDrug companies will still do very well. And while we’re at it let Medicare negotiate lower prices for prescription drugs, like the VA already does.', metadata={'source': 'state_of_the_union.txt'}),
Document(page_content='Putin’s latest attack on Ukraine was premeditated and unprovoked. \n\nHe rejected repeated efforts at diplomacy. \n\nHe thought the West and NATO wouldn’t respond. And he thought he could divide us at home. Putin was wrong. We were ready. Here is what we did. \n\nWe prepared extensively and carefully. \n\nWe spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin. \n\nI spent countless hours unifying our European allies. We shared with the world in advance what we knew Putin was planning and precisely how he would try to falsely justify his aggression. \n\nWe countered Russia’s lies with truth. \n\nAnd now that he has acted the free world is holding him accountable. \n\nAlong with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland.', metadata={'source': 'state_of_the_union.txt'})]

Retriever 選項

Weaviate 也可以用作 retriever

最大邊際相關性搜尋 (MMR)

除了在 retriever 物件中使用 similaritysearch 之外,您還可以使用 mmr

retriever = db.as_retriever(search_type="mmr")
retriever.invoke(query)[0]
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'state_of_the_union.txt'})

與 LangChain 一起使用

大型語言模型 (LLM) 的一個已知限制是,他們的訓練資料可能已過時,或者不包含您需要的特定領域知識。

請看下面的例子

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
llm.predict("What did the president say about Justice Breyer")
API 參考:ChatOpenAI
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.chat_models.openai.ChatOpenAI` was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import ChatOpenAI`.
warn_deprecated(
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `predict` was deprecated in LangChain 0.1.7 and will be removed in 0.2.0. Use invoke instead.
warn_deprecated(
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
"I'm sorry, I cannot provide real-time information as my responses are generated based on a mixture of licensed data, data created by human trainers, and publicly available data. The last update was in October 2021."

向量儲存透過提供一種儲存和檢索相關資訊的方式來補充 LLM。 這使您可以結合 LLM 和向量儲存的優勢,將 LLM 的推理和語言能力與向量儲存檢索相關資訊的能力結合起來。

結合 LLM 和向量儲存的兩個著名應用是

  • 問答
  • 檢索增強生成 (RAG)

帶來源的問答

在 langchain 中,可以使用向量儲存來增強問答功能。 讓我們看看如何做到這一點。

本節使用 RetrievalQAWithSourcesChain,它從索引中查找文件。

首先,我們將再次分割文字並將它們匯入到 Weaviate 向量儲存中。

from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import OpenAI
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
docsearch = WeaviateVectorStore.from_texts(
texts,
embeddings,
client=weaviate_client,
metadatas=[{"source": f"{i}-pl"} for i in range(len(texts))],
)

現在我們可以構建鏈,並指定 retriever

chain = RetrievalQAWithSourcesChain.from_chain_type(
OpenAI(temperature=0), chain_type="stuff", retriever=docsearch.as_retriever()
)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.llms.openai.OpenAI` was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAI`.
warn_deprecated(

並運行該鏈來提出問題

chain(
{"question": "What did the president say about Justice Breyer"},
return_only_outputs=True,
)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `__call__` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.
warn_deprecated(
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
{'answer': ' The president thanked Justice Stephen Breyer for his service and announced his nomination of Judge Ketanji Brown Jackson to the Supreme Court.\n',
'sources': '31-pl'}

檢索增強生成

結合 LLM 和向量儲存的另一個非常流行的應用是檢索增強生成 (RAG)。 這是一種使用 retriever 從向量儲存中尋找相關資訊的技術,然後使用 LLM 根據檢索到的資料和提示提供輸出。

我們從類似的設定開始

with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
docsearch = WeaviateVectorStore.from_texts(
texts,
embeddings,
client=weaviate_client,
metadatas=[{"source": f"{i}-pl"} for i in range(len(texts))],
)

retriever = docsearch.as_retriever()

我們需要為 RAG 模型構建一個範本,以便將檢索到的資訊填充到範本中。

from langchain_core.prompts import ChatPromptTemplate

template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)

print(prompt)
API 參考:ChatPromptTemplate
input_variables=['context', 'question'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question}\nContext: {context}\nAnswer:\n"))]
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
API 參考:ChatOpenAI

並且運行該儲存格,我們得到非常相似的輸出。

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)

rag_chain.invoke("What did the president say about Justice Breyer")
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
"The president honored Justice Stephen Breyer for his service to the country as an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. The president also mentioned nominating Circuit Court of Appeals Judge Ketanji Brown Jackson to continue Justice Breyer's legacy of excellence. The president expressed gratitude towards Justice Breyer and highlighted the importance of nominating someone to serve on the United States Supreme Court."

但請注意,由於範本由您構建,因此您可以根據需要自訂它。

總結 & 資源

Weaviate 是一個可擴展的、可投入生產的向量儲存。

這種整合使 Weaviate 能夠與 LangChain 一起使用,透過穩健的資料儲存來增強大型語言模型的功能。 其可擴展性和生產就緒性使其成為 LangChain 應用程式向量儲存的絕佳選擇,並且將縮短您的生產時間。


此頁面是否對您有所幫助?