跳至主要內容

Yellowbrick

Yellowbrick 是一個彈性的、大規模平行處理 (MPP) SQL 資料庫,可在雲端和地端執行,並使用 Kubernetes 進行擴展、恢復能力和雲端可攜性。 Yellowbrick 旨在解決最大且最複雜的業務關鍵型資料倉儲使用案例。 Yellowbrick 提供的規模效率使其也可以用作高效能且可擴展的向量資料庫,以使用 SQL 儲存和搜尋向量。

使用 Yellowbrick 作為 ChatGpt 的向量儲存

本教學示範如何建立一個由 ChatGpt 支援的簡單聊天機器人,該機器人使用 Yellowbrick 作為向量儲存,以支援檢索增強生成 (RAG)。您需要準備

  1. Yellowbrick 沙盒上的帳戶
  2. 來自 OpenAI 的 API 金鑰

本教學分為五個部分。首先,我們將使用 langchain 建立一個基準聊天機器人,以便在沒有向量儲存的情況下與 ChatGpt 互動。其次,我們將在 Yellowbrick 中建立一個嵌入表,該表將代表向量儲存。第三,我們將載入一系列文件(Yellowbrick 手冊的管理章節)。第四,我們將建立這些文件的向量表示並儲存在 Yellowbrick 表中。最後,我們將相同的查詢傳送到改進後的聊天機器人,以查看結果。

# Install all needed libraries
%pip install --upgrade --quiet langchain
%pip install --upgrade --quiet langchain-openai langchain-community
%pip install --upgrade --quiet psycopg2-binary
%pip install --upgrade --quiet tiktoken

設定:輸入用於連接 Yellowbrick 和 OpenAI API 的資訊

我們的聊天機器人透過 langchain 庫與 ChatGpt 整合,因此您首先需要來自 OpenAI 的 API 金鑰

若要取得 OpenAI 的 API 金鑰

  1. https://platform.openai.com/ 註冊
  2. 新增付款方式 - 您不太可能超過免費配額
  3. 建立 API 金鑰

您還需要註冊 Yellowbrick 沙盒帳戶時,從歡迎電子郵件中取得您的使用者名稱、密碼和資料庫名稱。

應修改以下內容,以包含您的 Yellowbrick 資料庫和 OpenAPI 金鑰的資訊

# Modify these values to match your Yellowbrick Sandbox and OpenAI API Key
YBUSER = "[SANDBOX USER]"
YBPASSWORD = "[SANDBOX PASSWORD]"
YBDATABASE = "[SANDBOX_DATABASE]"
YBHOST = "trialsandbox.sandbox.aws.yellowbrickcloud.com"

OPENAI_API_KEY = "[OPENAI API KEY]"
# Import libraries and setup keys / login info
import os
import pathlib
import re
import sys
import urllib.parse as urlparse
from getpass import getpass

import psycopg2
from IPython.display import Markdown, display
from langchain.chains import LLMChain, RetrievalQAWithSourcesChain
from langchain_community.vectorstores import Yellowbrick
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Establish connection parameters to Yellowbrick. If you've signed up for Sandbox, fill in the information from your welcome mail here:
yellowbrick_connection_string = (
f"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YBDATABASE}"
)

YB_DOC_DATABASE = "sample_data"
YB_DOC_TABLE = "yellowbrick_documentation"
embedding_table = "my_embeddings"

# API Key for OpenAI. Signup at https://platform.openai.com
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

from langchain_core.prompts.chat import (
ChatPromptTemplate,
HumanMessagePromptTemplate,
SystemMessagePromptTemplate,
)

第 1 部分:建立由 ChatGpt 支援,但沒有向量儲存的基準聊天機器人

我們將使用 langchain 查詢 ChatGPT。由於沒有向量儲存,ChatGPT 將沒有任何上下文來回答問題。

# Set up the chat model and specific prompt
system_template = """If you don't know the answer, Make up your best guess."""
messages = [
SystemMessagePromptTemplate.from_template(system_template),
HumanMessagePromptTemplate.from_template("{question}"),
]
prompt = ChatPromptTemplate.from_messages(messages)

chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(
model_name="gpt-3.5-turbo", # Modify model_name if you have access to GPT-4
temperature=0,
max_tokens=256,
)

chain = LLMChain(
llm=llm,
prompt=prompt,
verbose=False,
)


def print_result_simple(query):
result = chain(query)
output_text = f"""### Question:
{query}
### Answer:
{result['text']}
"""
display(Markdown(output_text))


# Use the chain to query
print_result_simple("How many databases can be in a Yellowbrick Instance?")

print_result_simple("What's an easy way to add users in bulk to Yellowbrick?")

第 2 部分:連接到 Yellowbrick 並建立嵌入表

若要將文件嵌入載入到 Yellowbrick 中,您應該建立自己的表來儲存它們。 請注意,表所在的 Yellowbrick 資料庫必須採用 UTF-8 編碼。

在 UTF-8 資料庫中建立具有以下架構的表,並提供您選擇的表名稱

# Establish a connection to the Yellowbrick database
try:
conn = psycopg2.connect(yellowbrick_connection_string)
except psycopg2.Error as e:
print(f"Error connecting to the database: {e}")
exit(1)

# Create a cursor object using the connection
cursor = conn.cursor()

# Define the SQL statement to create a table
create_table_query = f"""
CREATE TABLE IF NOT EXISTS {embedding_table} (
doc_id uuid NOT NULL,
embedding_id smallint NOT NULL,
embedding double precision NOT NULL
)
DISTRIBUTE ON (doc_id);
truncate table {embedding_table};
"""

# Execute the SQL query to create a table
try:
cursor.execute(create_table_query)
print(f"Table '{embedding_table}' created successfully!")
except psycopg2.Error as e:
print(f"Error creating table: {e}")
conn.rollback()

# Commit changes and close the cursor and connection
conn.commit()
cursor.close()
conn.close()

第 3 部分:從 Yellowbrick 中的現有表中提取要索引的文件

從現有的 Yellowbrick 表中提取文件路徑和內容。 我們將使用這些文件在下一步中建立嵌入。

yellowbrick_doc_connection_string = (
f"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YB_DOC_DATABASE}"
)

print(yellowbrick_doc_connection_string)

# Establish a connection to the Yellowbrick database
conn = psycopg2.connect(yellowbrick_doc_connection_string)

# Create a cursor object
cursor = conn.cursor()

# Query to select all documents from the table
query = f"SELECT path, document FROM {YB_DOC_TABLE}"

# Execute the query
cursor.execute(query)

# Fetch all documents
yellowbrick_documents = cursor.fetchall()

print(f"Extracted {len(yellowbrick_documents)} documents successfully!")

# Close the cursor and connection
cursor.close()
conn.close()

第 4 部分:將文件載入 Yellowbrick 向量儲存

瀏覽文件,將它們分成可消化的區塊,建立嵌入並插入到 Yellowbrick 表中。 這大約需要 5 分鐘。

# Split documents into chunks for conversion to embeddings
DOCUMENT_BASE_URL = "https://docs.yellowbrick.com/6.7.1/" # Actual URL


separator = "\n## " # This separator assumes Markdown docs from the repo uses ### as logical main header most of the time
chunk_size_limit = 2000
max_chunk_overlap = 200

documents = [
Document(
page_content=document[1],
metadata={"source": DOCUMENT_BASE_URL + document[0].replace(".md", ".html")},
)
for document in yellowbrick_documents
]

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size_limit,
chunk_overlap=max_chunk_overlap,
separators=[separator, "\nn", "\n", ",", " ", ""],
)
split_docs = text_splitter.split_documents(documents)

docs_text = [doc.page_content for doc in split_docs]

embeddings = OpenAIEmbeddings()
vector_store = Yellowbrick.from_documents(
documents=split_docs,
embedding=embeddings,
connection_string=yellowbrick_connection_string,
table=embedding_table,
)

print(f"Created vector store with {len(documents)} documents")

第 5 部分:建立使用 Yellowbrick 作為向量儲存的聊天機器人

接下來,我們將 Yellowbrick 新增為向量儲存。 向量儲存已填充有代表 Yellowbrick 產品文件管理章節的嵌入。

我們將發送與上面相同的查詢,以查看改進的回應。

system_template = """Use the following pieces of context to answer the users question.
Take note of the sources and include them in the answer in the format: "SOURCES: source1 source2", use "SOURCES" in capital letters regardless of the number of sources.
If you don't know the answer, just say that "I don't know", don't try to make up an answer.
----------------
{summaries}"""
messages = [
SystemMessagePromptTemplate.from_template(system_template),
HumanMessagePromptTemplate.from_template("{question}"),
]
prompt = ChatPromptTemplate.from_messages(messages)

vector_store = Yellowbrick(
OpenAIEmbeddings(),
yellowbrick_connection_string,
embedding_table, # Change the table name to reflect your embeddings
)

chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(
model_name="gpt-3.5-turbo", # Modify model_name if you have access to GPT-4
temperature=0,
max_tokens=256,
)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
return_source_documents=True,
chain_type_kwargs=chain_type_kwargs,
)


def print_result_sources(query):
result = chain(query)
output_text = f"""### Question:
{query}
### Answer:
{result['answer']}
### Sources:
{result['sources']}
### All relevant sources:
{', '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}
"""
display(Markdown(output_text))


# Use the chain to query

print_result_sources("How many databases can be in a Yellowbrick Instance?")

print_result_sources("Whats an easy way to add users in bulk to Yellowbrick?")

第六部分:引入索引以提高效能

Yellowbrick 也支援使用「區域敏感雜湊 (Locality-Sensitive Hashing, LSH)」方法建立索引。這是一種近似最近鄰搜尋技術,可以犧牲精確度來換取更快的相似度搜尋時間。此索引引入了兩個新的可調整參數:

  • 超平面的數量,作為 create_lsh_index(num_hyperplanes) 的參數提供。文件越多,需要的超平面就越多。 LSH 是一種降維的形式。原始嵌入會轉換為較低維度的向量,其中元件的數量與超平面的數量相同。
  • 漢明距離 (Hamming distance),一個代表搜尋範圍的整數。 較小的漢明距離會導致更快的檢索速度,但精確度較低。

以下是如何在我們載入到 Yellowbrick 的嵌入上建立索引。 我們也將重新執行先前的聊天會話,但這次檢索將使用索引。 請注意,對於這麼少的文件,您不會看到索引在效能方面的優勢。

system_template = """Use the following pieces of context to answer the users question.
Take note of the sources and include them in the answer in the format: "SOURCES: source1 source2", use "SOURCES" in capital letters regardless of the number of sources.
If you don't know the answer, just say that "I don't know", don't try to make up an answer.
----------------
{summaries}"""
messages = [
SystemMessagePromptTemplate.from_template(system_template),
HumanMessagePromptTemplate.from_template("{question}"),
]
prompt = ChatPromptTemplate.from_messages(messages)

vector_store = Yellowbrick(
OpenAIEmbeddings(),
yellowbrick_connection_string,
embedding_table, # Change the table name to reflect your embeddings
)

lsh_params = Yellowbrick.IndexParams(
Yellowbrick.IndexType.LSH, {"num_hyperplanes": 8, "hamming_distance": 2}
)
vector_store.create_index(lsh_params)

chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(
model_name="gpt-3.5-turbo", # Modify model_name if you have access to GPT-4
temperature=0,
max_tokens=256,
)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever(
k=5, search_kwargs={"index_params": lsh_params}
),
return_source_documents=True,
chain_type_kwargs=chain_type_kwargs,
)


def print_result_sources(query):
result = chain(query)
output_text = f"""### Question:
{query}
### Answer:
{result['answer']}
### Sources:
{result['sources']}
### All relevant sources:
{', '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}
"""
display(Markdown(output_text))


# Use the chain to query

print_result_sources("How many databases can be in a Yellowbrick Instance?")

print_result_sources("Whats an easy way to add users in bulk to Yellowbrick?")

後續步驟:

可以修改此程式碼以詢問不同的問題。 您也可以將自己的文件載入到向量儲存區。 langchain 模組非常靈活,可以解析各種檔案(包括 HTML、PDF 等)。

您也可以修改此程式碼以使用 Huggingface 嵌入模型和 Meta 的 Llama 2 LLM,以獲得完全私密的聊天機器人體驗。


此頁面是否有幫助?