跳到主要內容
Open In ColabOpen on GitHub

Yellowbrick

Yellowbrick 是一個彈性的、大規模平行處理 (MPP) SQL 資料庫,可在雲端和地端部署執行,並使用 kubernetes 實現擴展、彈性和雲端可攜性。Yellowbrick 旨在解決最大和最複雜的業務關鍵型資料倉儲用例。 Yellowbrick 提供的規模化效率也使其可以用作高效能且可擴展的向量資料庫,以使用 SQL 儲存和搜尋向量。

將 Yellowbrick 作為 ChatGpt 的向量資料庫使用

本教學示範如何建立一個由 ChatGpt 支援的簡單聊天機器人,該機器人使用 Yellowbrick 作為向量資料庫,以支援檢索增強生成 (RAG)。您需要準備:

  1. Yellowbrick 沙箱上的帳戶
  2. 來自 OpenAI 的 API 金鑰

本教學分為五個部分。首先,我們將使用 langchain 建立一個基準聊天機器人,以便在沒有向量資料庫的情況下與 ChatGpt 互動。其次,我們將在 Yellowbrick 中建立一個嵌入表,該表將代表向量資料庫。第三,我們將載入一系列文件(Yellowbrick 手冊的管理章節)。第四,我們將建立這些文件的向量表示形式,並將其儲存在 Yellowbrick 表格中。最後,我們將相同的查詢發送到改進後的聊天機器人,以查看結果。

# Install all needed libraries
%pip install --upgrade --quiet langchain
%pip install --upgrade --quiet langchain-openai langchain-community
%pip install --upgrade --quiet psycopg2-binary
%pip install --upgrade --quiet tiktoken

設定:輸入用於連接 Yellowbrick 和 OpenAI API 的資訊

我們的聊天機器人透過 langchain 函式庫與 ChatGpt 整合,因此您首先需要來自 OpenAI 的 API 金鑰

若要取得 OpenAI 的 API 金鑰

  1. https://platform.openai.com/ 註冊
  2. 新增付款方式 - 您不太可能超出免費配額
  3. 建立 API 金鑰

您還需要從註冊 Yellowbrick 沙箱帳戶時收到的歡迎電子郵件中取得您的使用者名稱、密碼和資料庫名稱。

應修改以下內容,以包含您的 Yellowbrick 資料庫和 OpenAPI 金鑰的資訊

# Modify these values to match your Yellowbrick Sandbox and OpenAI API Key
YBUSER = "[SANDBOX USER]"
YBPASSWORD = "[SANDBOX PASSWORD]"
YBDATABASE = "[SANDBOX_DATABASE]"
YBHOST = "trialsandbox.sandbox.aws.yellowbrickcloud.com"

OPENAI_API_KEY = "[OPENAI API KEY]"
# Import libraries and setup keys / login info
import os
import pathlib
import re
import sys
import urllib.parse as urlparse
from getpass import getpass

import psycopg2
from IPython.display import Markdown, display
from langchain.chains import LLMChain, RetrievalQAWithSourcesChain
from langchain_community.vectorstores import Yellowbrick
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Establish connection parameters to Yellowbrick. If you've signed up for Sandbox, fill in the information from your welcome mail here:
yellowbrick_connection_string = (
f"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YBDATABASE}"
)

YB_DOC_DATABASE = "sample_data"
YB_DOC_TABLE = "yellowbrick_documentation"
embedding_table = "my_embeddings"

# API Key for OpenAI. Signup at https://platform.openai.com
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

from langchain_core.prompts.chat import (
ChatPromptTemplate,
HumanMessagePromptTemplate,
SystemMessagePromptTemplate,
)

第一部分:建立由 ChatGpt 支援但沒有向量資料庫的基準聊天機器人

我們將使用 langchain 查詢 ChatGPT。由於沒有向量資料庫,ChatGPT 將沒有回答問題的上下文。

# Set up the chat model and specific prompt
system_template = """If you don't know the answer, Make up your best guess."""
messages = [
SystemMessagePromptTemplate.from_template(system_template),
HumanMessagePromptTemplate.from_template("{question}"),
]
prompt = ChatPromptTemplate.from_messages(messages)

chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(
model_name="gpt-3.5-turbo", # Modify model_name if you have access to GPT-4
temperature=0,
max_tokens=256,
)

chain = LLMChain(
llm=llm,
prompt=prompt,
verbose=False,
)


def print_result_simple(query):
result = chain(query)
output_text = f"""### Question:
{query}
### Answer:
{result['text']}
"""
display(Markdown(output_text))


# Use the chain to query
print_result_simple("How many databases can be in a Yellowbrick Instance?")

print_result_simple("What's an easy way to add users in bulk to Yellowbrick?")

第二部分:連接到 Yellowbrick 並建立嵌入表

若要將您的文件嵌入載入到 Yellowbrick 中,您應該建立自己的表格來儲存它們。請注意,表格所在的 Yellowbrick 資料庫必須是 UTF-8 編碼。

在 UTF-8 資料庫中使用以下結構描述建立表格,並提供您選擇的表格名稱

# Establish a connection to the Yellowbrick database
try:
conn = psycopg2.connect(yellowbrick_connection_string)
except psycopg2.Error as e:
print(f"Error connecting to the database: {e}")
exit(1)

# Create a cursor object using the connection
cursor = conn.cursor()

# Define the SQL statement to create a table
create_table_query = f"""
CREATE TABLE IF NOT EXISTS {embedding_table} (
doc_id uuid NOT NULL,
embedding_id smallint NOT NULL,
embedding double precision NOT NULL
)
DISTRIBUTE ON (doc_id);
truncate table {embedding_table};
"""

# Execute the SQL query to create a table
try:
cursor.execute(create_table_query)
print(f"Table '{embedding_table}' created successfully!")
except psycopg2.Error as e:
print(f"Error creating table: {e}")
conn.rollback()

# Commit changes and close the cursor and connection
conn.commit()
cursor.close()
conn.close()

第三部分:從 Yellowbrick 的現有表格中提取要索引的文件

從現有的 Yellowbrick 表格中提取文件路徑和內容。我們將在下一步中使用這些文件來建立嵌入。

yellowbrick_doc_connection_string = (
f"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YB_DOC_DATABASE}"
)

print(yellowbrick_doc_connection_string)

# Establish a connection to the Yellowbrick database
conn = psycopg2.connect(yellowbrick_doc_connection_string)

# Create a cursor object
cursor = conn.cursor()

# Query to select all documents from the table
query = f"SELECT path, document FROM {YB_DOC_TABLE}"

# Execute the query
cursor.execute(query)

# Fetch all documents
yellowbrick_documents = cursor.fetchall()

print(f"Extracted {len(yellowbrick_documents)} documents successfully!")

# Close the cursor and connection
cursor.close()
conn.close()

第四部分:使用文件載入 Yellowbrick 向量資料庫

瀏覽文件,將它們分成可消化的區塊,建立嵌入並插入到 Yellowbrick 表格中。這大約需要 5 分鐘。

# Split documents into chunks for conversion to embeddings
DOCUMENT_BASE_URL = "https://docs.yellowbrick.com/6.7.1/" # Actual URL


separator = "\n## " # This separator assumes Markdown docs from the repo uses ### as logical main header most of the time
chunk_size_limit = 2000
max_chunk_overlap = 200

documents = [
Document(
page_content=document[1],
metadata={"source": DOCUMENT_BASE_URL + document[0].replace(".md", ".html")},
)
for document in yellowbrick_documents
]

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size_limit,
chunk_overlap=max_chunk_overlap,
separators=[separator, "\nn", "\n", ",", " ", ""],
)
split_docs = text_splitter.split_documents(documents)

docs_text = [doc.page_content for doc in split_docs]

embeddings = OpenAIEmbeddings()
vector_store = Yellowbrick.from_documents(
documents=split_docs,
embedding=embeddings,
connection_string=yellowbrick_connection_string,
table=embedding_table,
)

print(f"Created vector store with {len(documents)} documents")

第五部分:建立使用 Yellowbrick 作為向量資料庫的聊天機器人

接下來,我們新增 Yellowbrick 作為向量資料庫。向量資料庫已填充了代表 Yellowbrick 產品文件管理章節的嵌入。

我們將發送與上面相同的查詢,以查看改進的回應。

system_template = """Use the following pieces of context to answer the users question.
Take note of the sources and include them in the answer in the format: "SOURCES: source1 source2", use "SOURCES" in capital letters regardless of the number of sources.
If you don't know the answer, just say that "I don't know", don't try to make up an answer.
----------------
{summaries}"""
messages = [
SystemMessagePromptTemplate.from_template(system_template),
HumanMessagePromptTemplate.from_template("{question}"),
]
prompt = ChatPromptTemplate.from_messages(messages)

vector_store = Yellowbrick(
OpenAIEmbeddings(),
yellowbrick_connection_string,
embedding_table, # Change the table name to reflect your embeddings
)

chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(
model_name="gpt-3.5-turbo", # Modify model_name if you have access to GPT-4
temperature=0,
max_tokens=256,
)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
return_source_documents=True,
chain_type_kwargs=chain_type_kwargs,
)


def print_result_sources(query):
result = chain(query)
output_text = f"""### Question:
{query}
### Answer:
{result['answer']}
### Sources:
{result['sources']}
### All relevant sources:
{', '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}
"""
display(Markdown(output_text))


# Use the chain to query

print_result_sources("How many databases can be in a Yellowbrick Instance?")

print_result_sources("Whats an easy way to add users in bulk to Yellowbrick?")

第六部分:引入索引以提高效能

Yellowbrick 也支援使用局部敏感雜湊方法的索引。這是一種近似最近鄰搜尋技術,允許人們以犧牲準確性為代價來權衡相似性搜尋時間。索引引入了兩個新的可調整參數

  • 超平面的數量,它作為參數提供給 create_lsh_index(num_hyperplanes)。文件越多,需要的超平面就越多。LSH 是一種降維形式。原始嵌入被轉換為較低維度的向量,其中組件的數量與超平面的數量相同。
  • 漢明距離,一個表示搜尋廣度的整數。較小的漢明距離會導致更快的檢索,但準確性較低。

以下是如何在我們載入到 Yellowbrick 中的嵌入上建立索引。我們也將重新執行先前的聊天會話,但這次檢索將使用索引。請注意,對於如此少量的文件,您不會看到索引在效能方面的優勢。

system_template = """Use the following pieces of context to answer the users question.
Take note of the sources and include them in the answer in the format: "SOURCES: source1 source2", use "SOURCES" in capital letters regardless of the number of sources.
If you don't know the answer, just say that "I don't know", don't try to make up an answer.
----------------
{summaries}"""
messages = [
SystemMessagePromptTemplate.from_template(system_template),
HumanMessagePromptTemplate.from_template("{question}"),
]
prompt = ChatPromptTemplate.from_messages(messages)

vector_store = Yellowbrick(
OpenAIEmbeddings(),
yellowbrick_connection_string,
embedding_table, # Change the table name to reflect your embeddings
)

lsh_params = Yellowbrick.IndexParams(
Yellowbrick.IndexType.LSH, {"num_hyperplanes": 8, "hamming_distance": 2}
)
vector_store.create_index(lsh_params)

chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(
model_name="gpt-3.5-turbo", # Modify model_name if you have access to GPT-4
temperature=0,
max_tokens=256,
)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever(
k=5, search_kwargs={"index_params": lsh_params}
),
return_source_documents=True,
chain_type_kwargs=chain_type_kwargs,
)


def print_result_sources(query):
result = chain(query)
output_text = f"""### Question:
{query}
### Answer:
{result['answer']}
### Sources:
{result['sources']}
### All relevant sources:
{', '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}
"""
display(Markdown(output_text))


# Use the chain to query

print_result_sources("How many databases can be in a Yellowbrick Instance?")

print_result_sources("Whats an easy way to add users in bulk to Yellowbrick?")

後續步驟:

可以修改此程式碼以詢問不同的問題。您也可以將自己的文件載入到向量資料庫中。langchain 模組非常靈活,可以解析各種檔案(包括 HTML、PDF 等)。

您也可以修改此程式碼以使用 Huggingface 嵌入模型和 Meta 的 Llama 2 LLM,以獲得完全私密的聊天機器人體驗。


此頁面是否對您有幫助?