跳到主要內容
Open In ColabOpen on GitHub

Vectara

Vectara 是值得信賴的 AI 助理和代理平台,專注於為關鍵任務應用程式做好企業級準備。

Vectara 無伺服器 RAG 即服務透過易於使用的 API 提供 RAG 的所有組件,包括

  1. 從檔案中提取文字的方法 (PDF、PPT、DOCX 等)
  2. 提供最先進效能的基於 ML 的分塊。
  3. Boomerang 嵌入模型。
  4. 其自身的內部向量資料庫,文字塊和嵌入向量儲存於其中。
  5. 一個查詢服務,可自動將查詢編碼為嵌入,並檢索最相關的文字片段 (包括支援 混合搜尋),以及多種重新排序選項,例如 多語言相關性重新排序器MMRUDF 重新排序器
  6. 一個 LLM,用於根據檢索到的文件(上下文)建立 生成式摘要,包括引文。

請參閱 Vectara API 文件,以獲取更多關於如何使用 API 的資訊。

本筆記本展示了如何使用基本檢索功能,當僅將 Vectara 用作向量資料庫(不含摘要)時,包括:similarity_searchsimilarity_search_with_score,以及使用 LangChain 的 as_retriever 功能。

您需要使用 pip install -qU langchain-community 安裝 langchain-community 才能使用此整合

開始使用

若要開始使用,請按照以下步驟操作

  1. 如果您還沒有帳戶,註冊 以獲取免費的 Vectara 試用版。完成註冊後,您將獲得一個 Vectara 客戶 ID。您可以透過點擊 Vectara 控制台視窗右上角您的姓名來找到您的客戶 ID。
  2. 在您的帳戶中,您可以建立一個或多個語料庫。每個語料庫代表一個區域,用於儲存從輸入文件擷取的文字資料。若要建立語料庫,請使用 「建立語料庫」 按鈕。然後,您需要為您的語料庫提供名稱和描述。您可以選擇性地定義篩選屬性並套用一些進階選項。如果您點擊您建立的語料庫,您可以在頂部看到其名稱和語料庫 ID。
  3. 接下來,您需要建立 API 金鑰以存取語料庫。點擊語料庫視圖中的 「存取控制」 標籤,然後點擊 「建立 API 金鑰」 按鈕。為您的金鑰命名,並選擇您想要金鑰僅用於查詢還是查詢+索引。點擊「建立」,您現在就有了一個有效的 API 金鑰。請務必對此金鑰保密。

若要將 LangChain 與 Vectara 搭配使用,您需要具備以下三個值:customer IDcorpus IDapi_key。您可以透過兩種方式將這些值提供給 LangChain

  1. 在您的環境中包含以下三個變數:VECTARA_CUSTOMER_IDVECTARA_CORPUS_IDVECTARA_API_KEY

    例如,您可以使用 os.environ 和 getpass 設定這些變數,如下所示

import os
import getpass

os.environ["VECTARA_CUSTOMER_ID"] = getpass.getpass("Vectara Customer ID:")
os.environ["VECTARA_CORPUS_ID"] = getpass.getpass("Vectara Corpus ID:")
os.environ["VECTARA_API_KEY"] = getpass.getpass("Vectara API Key:")
  1. 將它們添加到 Vectara 向量資料庫建構子中
vectara = Vectara(
vectara_customer_id=vectara_customer_id,
vectara_corpus_id=vectara_corpus_id,
vectara_api_key=vectara_api_key
)

在本筆記本中,我們假設它們已在環境中提供。

import os

os.environ["VECTARA_API_KEY"] = "<YOUR_VECTARA_API_KEY>"
os.environ["VECTARA_CORPUS_ID"] = "<YOUR_VECTARA_CORPUS_ID>"
os.environ["VECTARA_CUSTOMER_ID"] = "<YOUR_VECTARA_CUSTOMER_ID>"

from langchain_community.vectorstores import Vectara
from langchain_community.vectorstores.vectara import (
RerankConfig,
SummaryConfig,
VectaraQueryConfig,
)

首先,我們將國情咨文文本載入到 Vectara 中。

請注意,我們使用 from_files 介面,它不需要任何本地處理或分塊 - Vectara 接收檔案內容並執行所有必要的預處理、分塊以及將檔案嵌入到其知識庫中。

在本例中,它使用 .txt 檔案,但同樣適用於許多其他 檔案類型

vectara = Vectara.from_files(["state_of_the_union.txt"])

基本 Vectara RAG(檢索增強生成)

我們現在建立一個 VectaraQueryConfig 物件來控制檢索和摘要選項

  • 我們啟用摘要,指定我們希望 LLM 選擇前 7 個匹配的塊並以英文回應
  • 我們在檢索過程中啟用 MMR(最大邊際相關性),並設定 0.2 的多樣性偏差因子
  • 我們希望獲得前 10 個結果,並將混合搜尋配置為值 0.025

使用此配置,讓我們建立一個 LangChain Runnable 物件,它封裝了完整的 Vectara RAG 管道,使用 as_rag 方法

summary_config = SummaryConfig(is_enabled=True, max_results=7, response_lang="eng")
rerank_config = RerankConfig(reranker="mmr", rerank_k=50, mmr_diversity_bias=0.2)
config = VectaraQueryConfig(
k=10, lambda_val=0.005, rerank_config=rerank_config, summary_config=summary_config
)

query_str = "what did Biden say?"

rag = vectara.as_rag(config)
rag.invoke(query_str)["answer"]
"Biden addressed various topics in his statements. He highlighted the need to confront Putin by building a coalition of nations[1]. He also expressed commitment to investigating the impact of burn pits on soldiers' health, including his son's case[2]. Additionally, Biden outlined a plan to fight inflation by cutting prescription drug costs[3]. He emphasized the importance of continuing to combat COVID-19 and not just accepting living with it[4]. Furthermore, he discussed measures to weaken Russia economically and target Russian oligarchs[6]. Biden also advocated for passing the Equality Act to support LGBTQ+ Americans and condemned state laws targeting transgender individuals[7]."

我們也可以像這樣使用串流介面

output = {}
curr_key = None
for chunk in rag.stream(query_str):
for key in chunk:
if key not in output:
output[key] = chunk[key]
else:
output[key] += chunk[key]
if key == "answer":
print(chunk[key], end="", flush=True)
curr_key = key
Biden addressed various topics in his statements. He highlighted the importance of building coalitions to confront global challenges [1]. He also expressed commitment to investigating the impact of burn pits on soldiers' health, including his son's case [2, 4]. Additionally, Biden outlined his plan to combat inflation by cutting prescription drug costs and reducing the deficit, with support from Nobel laureates and business leaders [3]. He emphasized the ongoing fight against COVID-19 and the need to continue combating the virus [5]. Furthermore, Biden discussed measures taken to weaken Russia's economic and military strength, targeting Russian oligarchs and corrupt leaders [6]. He also advocated for passing the Equality Act to support LGBTQ+ Americans and address discriminatory state laws [7].

幻覺檢測和事實一致性評分

Vectara 創建了 HHEM - 一個開源模型,可用於評估 RAG 回應的事實一致性。

作為 Vectara RAG 的一部分,「事實一致性評分」(或 FCS),它是開源 HHEM 的改進版本,透過 API 提供。這會自動包含在 RAG 管道的輸出中

summary_config = SummaryConfig(is_enabled=True, max_results=5, response_lang="eng")
rerank_config = RerankConfig(reranker="mmr", rerank_k=50, mmr_diversity_bias=0.1)
config = VectaraQueryConfig(
k=10, lambda_val=0.005, rerank_config=rerank_config, summary_config=summary_config
)

rag = vectara.as_rag(config)
resp = rag.invoke(query_str)
print(resp["answer"])
print(f"Vectara FCS = {resp['fcs']}")
Biden addressed various topics in his statements. He highlighted the need to confront Putin by building a coalition of nations[1]. He also expressed his commitment to investigating the impact of burn pits on soldiers' health, referencing his son's experience[2]. Additionally, Biden discussed his plan to fight inflation by cutting prescription drug costs and garnering support from Nobel laureates and business leaders[4]. Furthermore, he emphasized the importance of continuing to combat COVID-19 and not merely accepting living with the virus[5]. Biden's remarks encompassed international relations, healthcare challenges faced by soldiers, economic strategies, and the ongoing battle against the pandemic.
Vectara FCS = 0.41796625

Vectara 作為 LangChain 檢索器

Vectara 組件也可以僅用作檢索器。

在這種情況下,它的行為就像任何其他 LangChain 檢索器。此模式的主要用途是語義搜尋,在這種情況下,我們停用摘要

config.summary_config.is_enabled = False
config.k = 3
retriever = vectara.as_retriever(config=config)
retriever.invoke(query_str)
[Document(page_content='He thought the West and NATO wouldn’t respond. And he thought he could divide us at home. We were ready.  Here is what we did. We prepared extensively and carefully. We spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin.', metadata={'lang': 'eng', 'section': '1', 'offset': '2160', 'len': '36', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content='When they came home, many of the world’s fittest and best trained warriors were never the same. Dizziness. \n\nA cancer that would put them in a flag-draped coffin. I know. \n\nOne of those soldiers was my son Major Beau Biden. We don’t know for sure if a burn pit was the cause of his brain cancer, or the diseases of so many of our troops. But I’m committed to finding out everything we can.', metadata={'lang': 'eng', 'section': '1', 'offset': '34652', 'len': '60', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content='But cancer from prolonged exposure to burn pits ravaged Heath’s lungs and body. Danielle says Heath was a fighter to the very end. He didn’t know how to stop fighting, and neither did she. Through her pain she found purpose to demand we do better. Tonight, Danielle—we are.', metadata={'lang': 'eng', 'section': '1', 'offset': '35442', 'len': '57', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'})]

為了向後相容性,您也可以在檢索器上啟用摘要,在這種情況下,摘要會作為額外的 Document 物件添加

config.summary_config.is_enabled = True
config.k = 3
retriever = vectara.as_retriever(config=config)
retriever.invoke(query_str)
[Document(page_content='He thought the West and NATO wouldn’t respond. And he thought he could divide us at home. We were ready.  Here is what we did. We prepared extensively and carefully. We spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin.', metadata={'lang': 'eng', 'section': '1', 'offset': '2160', 'len': '36', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content='When they came home, many of the world’s fittest and best trained warriors were never the same. Dizziness. \n\nA cancer that would put them in a flag-draped coffin. I know. \n\nOne of those soldiers was my son Major Beau Biden. We don’t know for sure if a burn pit was the cause of his brain cancer, or the diseases of so many of our troops. But I’m committed to finding out everything we can.', metadata={'lang': 'eng', 'section': '1', 'offset': '34652', 'len': '60', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content='But cancer from prolonged exposure to burn pits ravaged Heath’s lungs and body. Danielle says Heath was a fighter to the very end. He didn’t know how to stop fighting, and neither did she. Through her pain she found purpose to demand we do better. Tonight, Danielle—we are.', metadata={'lang': 'eng', 'section': '1', 'offset': '35442', 'len': '57', 'X-TIKA:Parsed-By': 'org.apache.tika.parser.csv.TextAndCSVParser', 'Content-Encoding': 'UTF-8', 'Content-Type': 'text/plain; charset=UTF-8', 'source': 'vectara'}),
Document(page_content="Biden discussed various topics in his statements. He highlighted the importance of unity and preparation to confront challenges, such as building coalitions to address global issues [1]. Additionally, he shared personal stories about the impact of health issues on soldiers, including his son's experience with brain cancer possibly linked to burn pits [2]. Biden also outlined his plans to combat inflation by cutting prescription drug costs and emphasized the ongoing efforts to combat COVID-19, rejecting the idea of merely living with the virus [4, 5]. Overall, Biden's messages revolved around unity, healthcare challenges faced by soldiers, economic plans, and the ongoing fight against COVID-19.", metadata={'summary': True, 'fcs': 0.54751414})]

使用 Vectara 進行進階 LangChain 查詢預處理

Vectara 的「RAG 即服務」完成了大部分繁重的工作,在建立問答或聊天機器人鏈方面。與 LangChain 的整合提供了使用其他功能(例如 SelfQueryRetrieverMultiQueryRetriever 等查詢預處理)的選項。讓我們看看使用 MultiQueryRetriever 的範例。

由於 MQR 使用 LLM,我們必須進行設定 - 在這裡我們選擇 ChatOpenAI

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0)
mqr = MultiQueryRetriever.from_llm(retriever=retriever, llm=llm)


def get_summary(documents):
return documents[-1].page_content


(mqr | get_summary).invoke(query_str)
API 參考文件:MultiQueryRetriever | ChatOpenAI
"Biden's statement highlighted his efforts to unite freedom-loving nations against Putin's aggression, sharing information in advance to counter Russian lies and hold Putin accountable[1]. Additionally, he emphasized his commitment to military families, like Danielle Robinson, and outlined plans for more affordable housing, Pre-K for 3- and 4-year-olds, and ensuring no additional taxes for those earning less than $400,000 a year[2][3]. The statement also touched on the readiness of the West and NATO to respond to Putin's actions, showcasing extensive preparation and coalition-building efforts[4]. Heath Robinson's story, a combat medic who succumbed to cancer from burn pits, was used to illustrate the resilience and fight for better conditions[5]."

此頁面是否對您有幫助?