跳到主要內容
Open In ColabOpen on GitHub

如何處理查詢分析中的高基數類別變數

您可能想要進行查詢分析,以在類別欄位上建立篩選器。其中一個困難點是,您通常需要指定「完全正確」的類別值。問題是您需要確保 LLM 完全正確地產生該類別值。當只有少數幾個有效值時,這可以透過提示相對容易地完成。當有效值的數量很高時,就會變得更加困難,因為這些值可能不適合 LLM 上下文,或者(如果適合)值太多,LLM 可能無法正確關注。

在本筆記本中,我們將看看如何處理這個問題。

設定

安裝依賴項

%pip install -qU langchain langchain-community langchain-openai faker langchain-chroma
Note: you may need to restart the kernel to use updated packages.

設定環境變數

在此範例中,我們將使用 OpenAI

import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass()

# Optional, uncomment to trace runs with LangSmith. Sign up here: https://smith.langchain.com.
# os.environ["LANGSMITH_TRACING"] = "true"
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

設定資料

我們將產生一堆假名稱

from faker import Faker

fake = Faker()

names = [fake.name() for _ in range(10000)]

讓我們看看一些名稱

names[0]
'Jacob Adams'
names[567]
'Eric Acevedo'

查詢分析

我們現在可以設定基準查詢分析

from pydantic import BaseModel, Field, model_validator
class Search(BaseModel):
query: str
author: str
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

system = """Generate a relevant search query for a library system"""
prompt = ChatPromptTemplate.from_messages(
[
("system", system),
("human", "{question}"),
]
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_llm = llm.with_structured_output(Search)
query_analyzer = {"question": RunnablePassthrough()} | prompt | structured_llm

我們可以看見,如果我們完全正確地拼寫名稱,它知道如何處理

query_analyzer.invoke("what are books about aliens by Jesse Knight")
Search(query='aliens', author='Jesse Knight')

問題是您想要篩選的值可能「未」完全正確地拼寫

query_analyzer.invoke("what are books about aliens by jess knight")
Search(query='aliens', author='Jess Knight')

加入所有值

一種解決方法是將「所有」可能的值添加到提示中。這通常會引導查詢朝正確的方向發展

system = """Generate a relevant search query for a library system.

`author` attribute MUST be one of:

{authors}

Do NOT hallucinate author name!"""
base_prompt = ChatPromptTemplate.from_messages(
[
("system", system),
("human", "{question}"),
]
)
prompt = base_prompt.partial(authors=", ".join(names))
query_analyzer_all = {"question": RunnablePassthrough()} | prompt | structured_llm

但是...如果類別變數列表夠長,則可能會出錯!

try:
res = query_analyzer_all.invoke("what are books about aliens by jess knight")
except Exception as e:
print(e)

我們可以嘗試使用更長的上下文窗口...但是,如果其中包含太多資訊,則不保證可以可靠地拾取它

llm_long = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)
structured_llm_long = llm_long.with_structured_output(Search)
query_analyzer_all = {"question": RunnablePassthrough()} | prompt | structured_llm_long
query_analyzer_all.invoke("what are books about aliens by jess knight")
Search(query='aliens', author='jess knight')

尋找所有相關值

相反,我們可以做的是在相關值上建立索引,然後查詢該索引以取得 N 個最相關的值,

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_texts(names, embeddings, collection_name="author_names")
API 參考:OpenAIEmbeddings
def select_names(question):
_docs = vectorstore.similarity_search(question, k=10)
_names = [d.page_content for d in _docs]
return ", ".join(_names)
create_prompt = {
"question": RunnablePassthrough(),
"authors": select_names,
} | base_prompt
query_analyzer_select = create_prompt | structured_llm
create_prompt.invoke("what are books by jess knight")
ChatPromptValue(messages=[SystemMessage(content='Generate a relevant search query for a library system.\n\n`author` attribute MUST be one of:\n\nJennifer Knight, Jill Knight, John Knight, Dr. Jeffrey Knight, Christopher Knight, Andrea Knight, Brandy Knight, Jennifer Keller, Becky Chambers, Sarah Knapp\n\nDo NOT hallucinate author name!'), HumanMessage(content='what are books by jess knight')])
query_analyzer_select.invoke("what are books about aliens by jess knight")
Search(query='books about aliens', author='Jennifer Knight')

選擇後替換

另一種方法是讓 LLM 填寫任何值,然後將該值轉換為有效值。這實際上可以使用 Pydantic 類別本身來完成!

class Search(BaseModel):
query: str
author: str

@model_validator(mode="before")
@classmethod
def double(cls, values: dict) -> dict:
author = values["author"]
closest_valid_author = vectorstore.similarity_search(author, k=1)[
0
].page_content
values["author"] = closest_valid_author
return values
system = """Generate a relevant search query for a library system"""
prompt = ChatPromptTemplate.from_messages(
[
("system", system),
("human", "{question}"),
]
)
corrective_structure_llm = llm.with_structured_output(Search)
corrective_query_analyzer = (
{"question": RunnablePassthrough()} | prompt | corrective_structure_llm
)
corrective_query_analyzer.invoke("what are books about aliens by jes knight")
Search(query='aliens', author='John Knight')
# TODO: show trigram similarity

此頁面是否有幫助?