如何在進行 SQL 問答時處理大型資料庫

為了針對資料庫編寫有效的查詢，我們需要向模型提供表名稱、表結構描述和特徵值，以便模型進行查詢。當表、欄位和/或高基數欄位很多時，我們不可能在每個提示中轉儲有關資料庫的完整資訊。相反地，我們必須找到僅將最相關資訊動態插入提示的方法。

在本指南中，我們示範了識別此類相關資訊並將其饋送到查詢產生步驟的方法。我們將涵蓋

識別相關的表子集；
識別相關的欄位值子集。

設定

首先，取得所需的套件並設定環境變數

%pip install --upgrade --quiet  langchain langchain-community langchain-openai

# Uncomment the below to use LangSmith. Not required.
# import os
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()
# os.environ["LANGSMITH_TRACING"] = "true"

以下範例將使用與 Chinook 資料庫的 SQLite 連線。請按照這些安裝步驟在與此筆記本相同的目錄中建立 Chinook.db

將此檔案另存為 Chinook_Sqlite.sql
執行 sqlite3 Chinook.db
執行 .read Chinook_Sqlite.sql
測試 SELECT * FROM Artist LIMIT 10;

現在，Chinook.db 在我們的目錄中，我們可以透過 SQLAlchemy 驅動的 SQLDatabase 類別與其介接

from langchain_community.utilities import SQLDatabase

db = SQLDatabase.from_uri("sqlite:///Chinook.db")
print(db.dialect)
print(db.get_usable_table_names())
print(db.run("SELECT * FROM Artist LIMIT 10;"))

API 參考：SQLDatabase

sqlite
['Album', 'Artist', 'Customer', 'Employee', 'Genre', 'Invoice', 'InvoiceLine', 'MediaType', 'Playlist', 'PlaylistTrack', 'Track']
[(1, 'AC/DC'), (2, 'Accept'), (3, 'Aerosmith'), (4, 'Alanis Morissette'), (5, 'Alice In Chains'), (6, 'Antônio Carlos Jobim'), (7, 'Apocalyptica'), (8, 'Audioslave'), (9, 'BackBeat'), (10, 'Billy Cobham')]

多個表

我們需要在提示中包含的主要資訊之一是相關表的結構描述。當我們有很多表時，我們無法將所有結構描述放入單個提示中。在這種情況下，我們可以先提取與使用者輸入相關的表名稱，然後僅包含其結構描述。

一種簡單且可靠的方法是使用工具調用。下面，我們展示如何使用此功能來取得符合所需格式的輸出（在本例中為表名稱列表）。我們使用聊天模型的 .bind_tools 方法以 Pydantic 格式綁定工具，並將其饋送到輸出解析器，以從模型的回應中重建物件。

選擇聊天模型

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

from langchain_core.output_parsers.openai_tools import PydanticToolsParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field


class Table(BaseModel):
    """Table in SQL database."""

    name: str = Field(description="Name of table in SQL database.")


table_names = "\n".join(db.get_usable_table_names())
system = f"""Return the names of ALL the SQL tables that MIGHT be relevant to the user question. \
The tables are:

{table_names}

Remember to include ALL POTENTIALLY RELEVANT tables, even if you're not sure that they're needed."""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{input}"),
    ]
)
llm_with_tools = llm.bind_tools([Table])
output_parser = PydanticToolsParser(tools=[Table])

table_chain = prompt | llm_with_tools | output_parser

table_chain.invoke({"input": "What are all the genres of Alanis Morisette songs"})

API 參考：PydanticToolsParser | ChatPromptTemplate

[Table(name='Genre')]

這運作得很好！除了，正如我們將在下面看到的，我們實際上還需要其他幾個表。僅根據使用者問題，模型很難知道這一點。在這種情況下，我們可能會考慮透過將表分組在一起來簡化模型的工作。我們只會要求模型在「音樂」和「商業」類別之間進行選擇，然後負責從那裡選擇所有相關的表

system = """Return the names of any SQL tables that are relevant to the user question.
The tables are:

Music
Business
"""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{input}"),
    ]
)

category_chain = prompt | llm_with_tools | output_parser
category_chain.invoke({"input": "What are all the genres of Alanis Morisette songs"})

[Table(name='Music'), Table(name='Business')]

from typing import List


def get_tables(categories: List[Table]) -> List[str]:
    tables = []
    for category in categories:
        if category.name == "Music":
            tables.extend(
                [
                    "Album",
                    "Artist",
                    "Genre",
                    "MediaType",
                    "Playlist",
                    "PlaylistTrack",
                    "Track",
                ]
            )
        elif category.name == "Business":
            tables.extend(["Customer", "Employee", "Invoice", "InvoiceLine"])
    return tables


table_chain = category_chain | get_tables
table_chain.invoke({"input": "What are all the genres of Alanis Morisette songs"})

['Album',
 'Artist',
 'Genre',
 'MediaType',
 'Playlist',
 'PlaylistTrack',
 'Track',
 'Customer',
 'Employee',
 'Invoice',
 'InvoiceLine']

現在我們已經獲得了一個可以輸出任何查詢的相關表的鏈，我們可以將其與我們的 create_sql_query_chain 結合使用，後者可以接受 table_names_to_use 列表來決定提示中包含哪些表結構描述

from operator import itemgetter

from langchain.chains import create_sql_query_chain
from langchain_core.runnables import RunnablePassthrough

query_chain = create_sql_query_chain(llm, db)
# Convert "question" key to the "input" key expected by current table_chain.
table_chain = {"input": itemgetter("question")} | table_chain
# Set table_names_to_use using table_chain.
full_chain = RunnablePassthrough.assign(table_names_to_use=table_chain) | query_chain

API 參考：create_sql_query_chain | RunnablePassthrough

query = full_chain.invoke(
    {"question": "What are all the genres of Alanis Morisette songs"}
)
print(query)

SELECT DISTINCT "g"."Name"
FROM "Genre" g
JOIN "Track" t ON "g"."GenreId" = "t"."GenreId"
JOIN "Album" a ON "t"."AlbumId" = "a"."AlbumId"
JOIN "Artist" ar ON "a"."ArtistId" = "ar"."ArtistId"
WHERE "ar"."Name" = 'Alanis Morissette'
LIMIT 5;

db.run(query)

"[('Rock',)]"

您可以在此處查看此執行的 LangSmith 追蹤。

我們已經了解如何在鏈中動態包含提示中的表結構描述子集。解決此問題的另一種可能方法是讓代理自行決定何時透過提供工具來查詢表。您可以在SQL：代理指南中看到此範例。

高基數欄位

為了篩選包含專有名詞（例如地址、歌曲名稱或藝術家）的欄位，我們首先需要仔細檢查拼字，以便正確篩選資料。

一種簡單的策略是建立一個向量儲存區，其中包含資料庫中存在的所有不同專有名詞。然後，我們可以查詢該向量儲存區的每個使用者輸入，並將最相關的專有名詞注入到提示中。

首先，我們需要每個實體的唯一值，為此我們定義一個函數，將結果解析為元素列表

import ast
import re


def query_as_list(db, query):
    res = db.run(query)
    res = [el for sub in ast.literal_eval(res) for el in sub if el]
    res = [re.sub(r"\b\d+\b", "", string).strip() for string in res]
    return res


proper_nouns = query_as_list(db, "SELECT Name FROM Artist")
proper_nouns += query_as_list(db, "SELECT Title FROM Album")
proper_nouns += query_as_list(db, "SELECT Name FROM Genre")
len(proper_nouns)
proper_nouns[:5]

['AC/DC', 'Accept', 'Aerosmith', 'Alanis Morissette', 'Alice In Chains']

現在我們可以嵌入所有值並將其儲存在向量資料庫中

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

vector_db = FAISS.from_texts(proper_nouns, OpenAIEmbeddings())
retriever = vector_db.as_retriever(search_kwargs={"k": 15})

API 參考：FAISS | OpenAIEmbeddings

並將查詢建構鏈放在一起，該鏈首先從資料庫檢索值並將其插入到提示中

from operator import itemgetter

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

system = """You are a SQLite expert. Given an input question, create a syntactically
correct SQLite query to run. Unless otherwise specificed, do not return more than
{top_k} rows.

Only return the SQL query with no markup or explanation.

Here is the relevant table info: {table_info}

Here is a non-exhaustive list of possible feature values. If filtering on a feature
value make sure to check its spelling against this list first:

{proper_nouns}
"""

prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{input}")])

query_chain = create_sql_query_chain(llm, db, prompt=prompt)
retriever_chain = (
    itemgetter("question")
    | retriever
    | (lambda docs: "\n".join(doc.page_content for doc in docs))
)
chain = RunnablePassthrough.assign(proper_nouns=retriever_chain) | query_chain

API 參考：ChatPromptTemplate | RunnablePassthrough

為了測試我們的鏈，讓我們看看當我們嘗試篩選「elenis moriset」（Alanis Morissette 的拼字錯誤）時，在有和沒有檢索的情況下會發生什麼

# Without retrieval
query = query_chain.invoke(
    {"question": "What are all the genres of elenis moriset songs", "proper_nouns": ""}
)
print(query)
db.run(query)

SELECT DISTINCT g.Name 
FROM Track t
JOIN Album a ON t.AlbumId = a.AlbumId
JOIN Artist ar ON a.ArtistId = ar.ArtistId
JOIN Genre g ON t.GenreId = g.GenreId
WHERE ar.Name = 'Elenis Moriset';

''

# With retrieval
query = chain.invoke({"question": "What are all the genres of elenis moriset songs"})
print(query)
db.run(query)

SELECT DISTINCT g.Name
FROM Genre g
JOIN Track t ON g.GenreId = t.GenreId
JOIN Album a ON t.AlbumId = a.AlbumId
JOIN Artist ar ON a.ArtistId = ar.ArtistId
WHERE ar.Name = 'Alanis Morissette';

"[('Rock',)]"

我們可以看見，透過檢索，我們能夠將拼字從「Elenis Moriset」更正為「Alanis Morissette」，並取回有效的結果。

解決此問題的另一種可能方法是讓代理自行決定何時查詢專有名詞。您可以在SQL：代理指南中看到此範例。

設定​

多個表​

高基數欄位​

此頁面是否對您有幫助？

設定

多個表

高基數欄位