Vectara
Vectara 是一個值得信賴的 AI 助理和代理平台,專注於企業關鍵任務應用程式的準備。
Vectara 無伺服器 RAG 即服務提供 RAG 的所有元件,並提供易於使用的 API,包括:
- 一種從檔案(PDF、PPT、DOCX 等)提取文字的方法
- 基於 ML 的分塊,提供最先進的效能。
- Boomerang 嵌入模型。
- 其自身的內部向量資料庫,用於儲存文字塊和嵌入向量。
- 一個查詢服務,可以自動將查詢編碼為嵌入,並檢索最相關的文字片段(包括支援混合搜尋以及多種重新排序選項,例如多語言相關性重新排序器、MMR、UDF 重新排序器)。
- 一個 LLM,用於根據檢索到的文件(上下文)創建生成摘要,包括引文。
有關如何使用 API 的更多信息,請參閱Vectara API 文檔。
本筆記本展示了當僅將 Vectara 用作向量儲存(不進行摘要)時,如何使用基本檢索功能,包括:similarity_search
和 similarity_search_with_score
以及使用 LangChain as_retriever
功能。
您需要使用 pip install -qU langchain-community
安裝 langchain-community
才能使用此整合
開始使用
要開始使用,請按照以下步驟操作
- 如果您還沒有帳戶,請註冊免費的 Vectara 試用版。完成註冊後,您將擁有一個 Vectara 客戶 ID。您可以通過單擊 Vectara 控制台視窗右上角您的姓名來找到您的客戶 ID。
- 在您的帳戶中,您可以創建一個或多個語料庫。每個語料庫代表一個區域,用於儲存從輸入文檔提取的文字資料。要創建語料庫,請使用 "Create Corpus"(創建語料庫)按鈕。然後,您需要為您的語料庫提供名稱和描述。您可以選擇定義篩選屬性並應用一些高級選項。如果您點擊您創建的語料庫,您可以在頂部看到它的名稱和語料庫 ID。
- 接下來,您需要創建 API 金鑰來訪問語料庫。點擊語料庫視圖中的 "Access Control"(存取控制)標籤,然後點擊 "Create API Key"(創建 API 金鑰)按鈕。為您的金鑰命名,然後選擇您希望金鑰僅用於查詢還是用於查詢+索引。點擊 "Create"(創建),您現在就有一個有效的 API 金鑰。請保守此金鑰。
要將 LangChain 與 Vectara 搭配使用,您需要這三個值:customer ID
(客戶 ID)、corpus ID
(語料庫 ID)和 api_key
(API 金鑰)。您可以通過兩種方式將這些提供給 LangChain
-
將這三個變數包含在您的環境中:
VECTARA_CUSTOMER_ID
、VECTARA_CORPUS_ID
和VECTARA_API_KEY
。例如,您可以按照以下方式使用 os.environ 和 getpass 設置這些變數
import os
import getpass
os.environ["VECTARA_CUSTOMER_ID"] = getpass.getpass("Vectara Customer ID:")
os.environ["VECTARA_CORPUS_ID"] = getpass.getpass("Vectara Corpus ID:")
os.environ["VECTARA_API_KEY"] = getpass.getpass("Vectara API Key:")
- 將它們添加到
Vectara
vectorstore 建構函式中
vectara = Vectara(
vectara_customer_id=vectara_customer_id,
vectara_corpus_id=vectara_corpus_id,
vectara_api_key=vectara_api_key
)
在本筆記本中,我們假設它們在環境中提供。
import os
os.environ["VECTARA_API_KEY"] = "<YOUR_VECTARA_API_KEY>"
os.environ["VECTARA_CORPUS_ID"] = "<YOUR_VECTARA_CORPUS_ID>"
os.environ["VECTARA_CUSTOMER_ID"] = "<YOUR_VECTARA_CUSTOMER_ID>"
from langchain_community.vectorstores import Vectara
from langchain_community.vectorstores.vectara import (
RerankConfig,
SummaryConfig,
VectaraQueryConfig,
)
首先,我們將國情咨文文字載入到 Vectara 中。
請注意,我們使用 from_files
介面,該介面不需要任何本地處理或分塊 - Vectara 接收檔案內容並執行所有必要的預處理、分塊以及將檔案嵌入到其知識儲存中。
在這種情況下,它使用 .txt
檔案,但同樣的方法適用於許多其他檔案類型。
vectara = Vectara.from_files(["state_of_the_union.txt"])
基本 Vectara RAG(檢索增強生成)
我們現在建立一個 VectaraQueryConfig
物件來控制檢索和摘要選項
- 我們啟用摘要功能,並指定我們希望 LLM 選擇前 7 個最匹配的區塊,並以英文回覆
- 我們在檢索過程中啟用 MMR (最大邊際相關性),並設定多樣性偏差因子為 0.2
- 我們想要前 10 個結果,並配置混合搜尋的值為 0.025
使用此配置,讓我們建立一個 LangChain Runnable
物件,它使用 as_rag
方法封裝完整的 Vectara RAG 流程
summary_config = SummaryConfig(is_enabled=True, max_results=7, response_lang="eng")
rerank_config = RerankConfig(reranker="mmr", rerank_k=50, mmr_diversity_bias=0.2)
config = VectaraQueryConfig(
k=10, lambda_val=0.005, rerank_config=rerank_config, summary_config=summary_config
)
query_str = "what did Biden say?"
rag = vectara.as_rag(config)
rag.invoke(query_str)["answer"]
"Biden addressed various topics in his statements. He highlighted the need to confront Putin by building a coalition of nations[1]. He also expressed commitment to investigating the impact of burn pits on soldiers' health, including his son's case[2]. Additionally, Biden outlined a plan to fight inflation by cutting prescription drug costs[3]. He emphasized the importance of continuing to combat COVID-19 and not just accepting living with it[4]. Furthermore, he discussed measures to weaken Russia economically and target Russian oligarchs[6]. Biden also advocated for passing the Equality Act to support LGBTQ+ Americans and condemned state laws targeting transgender individuals[7]."
我們也可以像這樣使用串流介面
output = {}
curr_key = None
for chunk in rag.stream(query_str):
for key in chunk:
if key not in output:
output[key] = chunk[key]
else:
output[key] += chunk[key]
if key == "answer":
print(chunk[key], end="", flush=True)
curr_key = key
Biden addressed various topics in his statements. He highlighted the importance of building coalitions to confront global challenges [1]. He also expressed commitment to investigating the impact of burn pits on soldiers' health, including his son's case [2, 4]. Additionally, Biden outlined his plan to combat inflation by cutting prescription drug costs and reducing the deficit, with support from Nobel laureates and business leaders [3]. He emphasized the ongoing fight against COVID-19 and the need to continue combating the virus [5]. Furthermore, Biden discussed measures taken to weaken Russia's economic and military strength, targeting Russian oligarchs and corrupt leaders [6]. He also advocated for passing the Equality Act to support LGBTQ+ Americans and address discriminatory state laws [7].
幻覺檢測與事實一致性評分
Vectara 創建了 HHEM - 一個可以用於評估 RAG 回應的事實一致性的開源模型。
作為 Vectara RAG 的一部分,"事實一致性評分" (FCS),是開源 HHEM 的改進版本,可通過 API 使用。這會自動包含在 RAG 流程的輸出中
summary_config = SummaryConfig(is_enabled=True, max_results=5, response_lang="eng")
rerank_config = RerankConfig(reranker="mmr", rerank_k=50, mmr_diversity_bias=0.1)
config = VectaraQueryConfig(
k=10, lambda_val=0.005, rerank_config=rerank_config, summary_config=summary_config
)
rag = vectara.as_rag(config)
resp = rag.invoke(query_str)
print(resp["answer"])
print(f"Vectara FCS = {resp['fcs']}")
Biden addressed various topics in his statements. He highlighted the need to confront Putin by building a coalition of nations[1]. He also expressed his commitment to investigating the impact of burn pits on soldiers' health, referencing his son's experience[2]. Additionally, Biden discussed his plan to fight inflation by cutting prescription drug costs and garnering support from Nobel laureates and business leaders[4]. Furthermore, he emphasized the importance of continuing to combat COVID-19 and not merely accepting living with the virus[5]. Biden's remarks encompassed international relations, healthcare challenges faced by soldiers, economic strategies, and the ongoing battle against the pandemic.
Vectara FCS = 0.41796625
Vectara 作為 LangChain 檢索器
Vectara 组件也可以只作为检索器使用。
在這種情況下,它的行為就像任何其他 LangChain 檢索器一樣。此模式的主要用途是進行語義搜尋,在這種情況下,我們禁用摘要功能
config.summary_config.is_enabled = False
config.k = 3
retriever = vectara.as_retriever(config=config)
retriever.invoke(query_str)
[Document(page_content='He thought the West and NATO wouldn’t respond. And he thought he could divide us at home. We were ready. Here is what we did. We prepared extensively and carefully. We spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin.', metadata={'lang': 'eng', 'section': '1', 'offset': '2160', 'len': '36', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content='When they came home, many of the world’s fittest and best trained warriors were never the same. Dizziness. \n\nA cancer that would put them in a flag-draped coffin. I know. \n\nOne of those soldiers was my son Major Beau Biden. We don’t know for sure if a burn pit was the cause of his brain cancer, or the diseases of so many of our troops. But I’m committed to finding out everything we can.', metadata={'lang': 'eng', 'section': '1', 'offset': '34652', 'len': '60', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content='But cancer from prolonged exposure to burn pits ravaged Heath’s lungs and body. Danielle says Heath was a fighter to the very end. He didn’t know how to stop fighting, and neither did she. Through her pain she found purpose to demand we do better. Tonight, Danielle—we are.', metadata={'lang': 'eng', 'section': '1', 'offset': '35442', 'len': '57', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'})]
為了向後相容,您也可以在檢索器上啟用摘要功能,在這種情況下,摘要會作為額外的 Document 物件新增
config.summary_config.is_enabled = True
config.k = 3
retriever = vectara.as_retriever(config=config)
retriever.invoke(query_str)
[Document(page_content='He thought the West and NATO wouldn’t respond. And he thought he could divide us at home. We were ready. Here is what we did. We prepared extensively and carefully. We spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin.', metadata={'lang': 'eng', 'section': '1', 'offset': '2160', 'len': '36', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content='When they came home, many of the world’s fittest and best trained warriors were never the same. Dizziness. \n\nA cancer that would put them in a flag-draped coffin. I know. \n\nOne of those soldiers was my son Major Beau Biden. We don’t know for sure if a burn pit was the cause of his brain cancer, or the diseases of so many of our troops. But I’m committed to finding out everything we can.', metadata={'lang': 'eng', 'section': '1', 'offset': '34652', 'len': '60', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content='But cancer from prolonged exposure to burn pits ravaged Heath’s lungs and body. Danielle says Heath was a fighter to the very end. He didn’t know how to stop fighting, and neither did she. Through her pain she found purpose to demand we do better. Tonight, Danielle—we are.', metadata={'lang': 'eng', 'section': '1', 'offset': '35442', 'len': '57', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content="Biden discussed various topics in his statements. He highlighted the importance of unity and preparation to confront challenges, such as building coalitions to address global issues [1]. Additionally, he shared personal stories about the impact of health issues on soldiers, including his son's experience with brain cancer possibly linked to burn pits [2]. Biden also outlined his plans to combat inflation by cutting prescription drug costs and emphasized the ongoing efforts to combat COVID-19, rejecting the idea of merely living with the virus [4, 5]. Overall, Biden's messages revolved around unity, healthcare challenges faced by soldiers, economic plans, and the ongoing fight against COVID-19.", metadata={'summary': True, 'fcs': 0.54751414})]
使用 Vectara 進行進階 LangChain 查詢預處理
Vectara 的 "RAG 即服務" 在創建問答或聊天機器人鏈時做了大量繁重的工作。與 LangChain 的集成提供了使用其他功能(例如查詢預處理,例如 SelfQueryRetriever
或 MultiQueryRetriever
)的選項。 讓我們來看一個使用 MultiQueryRetriever 的範例。
由於 MQR 使用 LLM,因此我們必須進行設置 - 在此我們選擇 ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai.chat_models import ChatOpenAI
llm = ChatOpenAI(temperature=0)
mqr = MultiQueryRetriever.from_llm(retriever=retriever, llm=llm)
def get_summary(documents):
return documents[-1].page_content
(mqr | get_summary).invoke(query_str)
"Biden's statement highlighted his efforts to unite freedom-loving nations against Putin's aggression, sharing information in advance to counter Russian lies and hold Putin accountable[1]. Additionally, he emphasized his commitment to military families, like Danielle Robinson, and outlined plans for more affordable housing, Pre-K for 3- and 4-year-olds, and ensuring no additional taxes for those earning less than $400,000 a year[2][3]. The statement also touched on the readiness of the West and NATO to respond to Putin's actions, showcasing extensive preparation and coalition-building efforts[4]. Heath Robinson's story, a combat medic who succumbed to cancer from burn pits, was used to illustrate the resilience and fight for better conditions[5]."