跳到主要內容
Open In ColabOpen on GitHub

如何在進行萃取時使用參考範例

通常可以透過向 LLM 提供參考範例來提高萃取品質。

資料萃取嘗試產生在文字和其他非結構化或半結構化格式中找到的資訊的結構化表示工具呼叫 LLM 功能通常在此情境中使用。本指南示範如何建置少量工具呼叫範例,以協助引導萃取和類似應用程式的行為。

提示

雖然本指南重點介紹如何將範例與工具呼叫模型搭配使用,但此技術通常適用,並且也適用於 JSON 或更多基於提示的技術。

LangChain 在來自包含工具呼叫的 LLM 的訊息上實作了 工具呼叫屬性。如需更多詳細資訊,請參閱我們的 工具呼叫操作指南。為了建置資料萃取的參考範例,我們建置了包含以下序列的聊天歷史記錄

LangChain 採用此慣例,跨 LLM 模型提供者將工具呼叫結構化為對話。

首先,我們建置一個提示範本,其中包含這些訊息的預留位置

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
# about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are an expert extraction algorithm. "
"Only extract relevant information from the text. "
"If you do not know the value of an attribute asked "
"to extract, return null for the attribute's value.",
),
# ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
MessagesPlaceholder("examples"), # <-- EXAMPLES!
# ↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑
("human", "{text}"),
]
)

測試範本

from langchain_core.messages import (
HumanMessage,
)

prompt.invoke(
{"text": "this is some text", "examples": [HumanMessage(content="testing 1 2 3")]}
)
API 參考:HumanMessage
ChatPromptValue(messages=[SystemMessage(content="You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.", additional_kwargs={}, response_metadata={}), HumanMessage(content='testing 1 2 3', additional_kwargs={}, response_metadata={}), HumanMessage(content='this is some text', additional_kwargs={}, response_metadata={})])

定義架構

讓我們重複使用萃取教學中的人員架構。

from typing import List, Optional

from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field


class Person(BaseModel):
"""Information about a person."""

# ^ Doc-string for the entity Person.
# This doc-string is sent to the LLM as the description of the schema Person,
# and it can help to improve extraction results.

# Note that:
# 1. Each field is an `optional` -- this allows the model to decline to extract it!
# 2. Each field has a `description` -- this description is used by the LLM.
# Having a good description can help improve extraction results.
name: Optional[str] = Field(..., description="The name of the person")
hair_color: Optional[str] = Field(
..., description="The color of the person's hair if known"
)
height_in_meters: Optional[str] = Field(..., description="Height in METERs")


class Data(BaseModel):
"""Extracted data about people."""

# Creates a model so that we can extract multiple entities.
people: List[Person]
API 參考:ChatOpenAI

定義參考範例

範例可以定義為輸入-輸出配對的清單。

每個範例都包含範例 input 文字和範例 output,顯示應從文字中萃取的內容。

重要

這有點深入,所以請隨意跳過。

範例的格式需要與使用的 API 相符(例如,工具呼叫或 JSON 模式等)。

在這裡,格式化的範例將與工具呼叫 API 預期的格式相符,因為這就是我們正在使用的。

import uuid
from typing import Dict, List, TypedDict

from langchain_core.messages import (
AIMessage,
BaseMessage,
HumanMessage,
SystemMessage,
ToolMessage,
)
from pydantic import BaseModel, Field


class Example(TypedDict):
"""A representation of an example consisting of text input and expected tool calls.

For extraction, the tool calls are represented as instances of pydantic model.
"""

input: str # This is the example text
tool_calls: List[BaseModel] # Instances of pydantic model that should be extracted


def tool_example_to_messages(example: Example) -> List[BaseMessage]:
"""Convert an example into a list of messages that can be fed into an LLM.

This code is an adapter that converts our example to a list of messages
that can be fed into a chat model.

The list of messages per example corresponds to:

1) HumanMessage: contains the content from which content should be extracted.
2) AIMessage: contains the extracted information from the model
3) ToolMessage: contains confirmation to the model that the model requested a tool correctly.

The ToolMessage is required because some of the chat models are hyper-optimized for agents
rather than for an extraction use case.
"""
messages: List[BaseMessage] = [HumanMessage(content=example["input"])]
tool_calls = []
for tool_call in example["tool_calls"]:
tool_calls.append(
{
"id": str(uuid.uuid4()),
"args": tool_call.dict(),
# The name of the function right now corresponds
# to the name of the pydantic model
# This is implicit in the API right now,
# and will be improved over time.
"name": tool_call.__class__.__name__,
},
)
messages.append(AIMessage(content="", tool_calls=tool_calls))
tool_outputs = example.get("tool_outputs") or [
"You have correctly called this tool."
] * len(tool_calls)
for output, tool_call in zip(tool_outputs, tool_calls):
messages.append(ToolMessage(content=output, tool_call_id=tool_call["id"]))
return messages

接下來,讓我們定義範例,然後將它們轉換為訊息格式。

examples = [
(
"The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.",
Data(people=[]),
),
(
"Fiona traveled far from France to Spain.",
Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)]),
),
]


messages = []

for text, tool_call in examples:
messages.extend(
tool_example_to_messages({"input": text, "tool_calls": [tool_call]})
)

讓我們測試提示

example_prompt = prompt.invoke({"text": "this is some text", "examples": messages})

for message in example_prompt.messages:
print(f"{message.type}: {message}")
system: content="You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value." additional_kwargs={} response_metadata={}
human: content="The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it." additional_kwargs={} response_metadata={}
ai: content='' additional_kwargs={} response_metadata={} tool_calls=[{'name': 'Data', 'args': {'people': []}, 'id': '240159b1-1405-4107-a07c-3c6b91b3d5b7', 'type': 'tool_call'}]
tool: content='You have correctly called this tool.' tool_call_id='240159b1-1405-4107-a07c-3c6b91b3d5b7'
human: content='Fiona traveled far from France to Spain.' additional_kwargs={} response_metadata={}
ai: content='' additional_kwargs={} response_metadata={} tool_calls=[{'name': 'Data', 'args': {'people': [{'name': 'Fiona', 'hair_color': None, 'height_in_meters': None}]}, 'id': '3fc521e4-d1d2-4c20-bf40-e3d72f1068da', 'type': 'tool_call'}]
tool: content='You have correctly called this tool.' tool_call_id='3fc521e4-d1d2-4c20-bf40-e3d72f1068da'
human: content='this is some text' additional_kwargs={} response_metadata={}

建立萃取器

讓我們選取一個 LLM。由於我們正在使用工具呼叫,因此我們需要一個支援工具呼叫功能的模型。請參閱此表以取得可用的 LLM。

pip install -qU "langchain[openai]"
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4-0125-preview", model_provider="openai", temperature=0)

依照萃取教學,我們使用 .with_structured_output 方法根據所需的架構來結構化模型輸出

runnable = prompt | llm.with_structured_output(
schema=Data,
method="function_calling",
include_raw=False,
)

沒有範例 😿

請注意,即使功能強大的模型也可能在非常簡單的測試案例中失敗!

for _ in range(5):
text = "The solar system is large, but earth has only 1 moon."
print(runnable.invoke({"text": text, "examples": []}))
people=[Person(name='earth', hair_color='null', height_in_meters='null')]
``````output
people=[Person(name='earth', hair_color='null', height_in_meters='null')]
``````output
people=[]
``````output
people=[Person(name='earth', hair_color='null', height_in_meters='null')]
``````output
people=[]

有範例 😻

參考範例有助於修正失敗!

for _ in range(5):
text = "The solar system is large, but earth has only 1 moon."
print(runnable.invoke({"text": text, "examples": messages}))
people=[]
``````output
people=[]
``````output
people=[]
``````output
people=[]
``````output
people=[]

請注意,我們可以在 Langsmith 追蹤中將少量範例視為工具呼叫。

並且我們在正面範例上保留效能

runnable.invoke(
{
"text": "My name is Harrison. My hair is black.",
"examples": messages,
}
)
Data(people=[Person(name='Harrison', hair_color='black', height_in_meters=None)])

此頁面是否對您有幫助?