建立萃取鏈

在本教學課程中，我們將使用工具調用功能的聊天模型，從非結構化文字中萃取結構化資訊。我們也將示範如何在這種情況下使用少量提示來提高效能。

重要

本教學課程需要 langchain-core>=0.3.20，並且僅適用於支援工具調用的模型。

設定

Jupyter Notebook

本教學課程和其他教學課程或許最方便在 Jupyter Notebook 中執行。在互動式環境中瀏覽指南是更好地理解它們的好方法。請參閱此處以取得有關如何安裝的說明。

安裝

若要安裝 LangChain，請執行

Pip
Conda

pip install --upgrade langchain-core

conda install langchain-core -c conda-forge

如需更多詳細資訊，請參閱我們的安裝指南。

LangSmith

您使用 LangChain 建構的許多應用程式將包含多個步驟，並多次調用 LLM 呼叫。隨著這些應用程式變得越來越複雜，能夠檢查您的鏈或代理程式內部到底發生什麼事情變得至關重要。最好的方法是使用 LangSmith。

在您在上面的連結註冊後，請務必設定您的環境變數以開始記錄追蹤

export LANGSMITH_TRACING="true"
export LANGSMITH_API_KEY="..."

或者，如果在筆記本中，您可以使用以下程式碼設定它們

import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

結構描述

首先，我們需要描述我們想要從文字中萃取哪些資訊。

我們將使用 Pydantic 來定義範例結構描述，以萃取個人資訊。

from typing import Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[str] = Field(
        default=None, description="Height measured in meters"
    )

定義結構描述時，有兩個最佳實務

記錄屬性和結構描述本身：此資訊會傳送至 LLM，並用於提高資訊萃取的品質。
不要強迫 LLM 捏造資訊！在上面，我們對允許 LLM 輸出 None 的屬性使用了 Optional，如果它不知道答案。

重要

為了獲得最佳效能，請妥善記錄結構描述，並確保模型不會被迫在文字中沒有要萃取的資訊時傳回結果。

萃取器

讓我們使用我們在上面定義的結構描述建立資訊萃取器。

from typing import Optional

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        # Please see the how-to about improving performance with
        # reference examples.
        # MessagesPlaceholder('examples'),
        ("human", "{text}"),
    ]
)

API 參考：ChatPromptTemplate | MessagesPlaceholder

我們需要使用支援函數/工具調用的模型。

請檢閱文件以了解可用於此 API 的所有模型。

選取聊天模型

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

structured_llm = llm.with_structured_output(schema=Person)

讓我們測試一下

text = "Alan Smith is 6 feet tall and has blond hair."
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)

Person(name='Alan Smith', hair_color='blond', height_in_meters='1.83')

重要

萃取是生成式的 🤯

LLM 是生成模型，因此它們可以做一些非常酷的事情，例如正確萃取人的身高（以公尺為單位），即使它是以英尺為單位提供的！

我們可以在此處看到 LangSmith 追蹤。請注意，追蹤的聊天模型部分揭示了傳送至模型的確切訊息序列、調用的工具和其他中繼資料。

多個實體

在大多數情況下，您應該萃取實體清單，而不是單一實體。

這可以使用 pydantic 透過將模型彼此巢狀化來輕鬆實現。

from typing import List, Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[str] = Field(
        default=None, description="Height measured in meters"
    )


class Data(BaseModel):
    """Extracted data about people."""

    # Creates a model so that we can extract multiple entities.
    people: List[Person]

重要

此處的萃取結果可能不盡完美。請繼續閱讀以了解如何使用參考範例來提高萃取的品質，並查看我們的萃取操作指南以了解更多詳細資訊。

structured_llm = llm.with_structured_output(schema=Data)
text = "My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me."
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)

Data(people=[Person(name='Jeff', hair_color='black', height_in_meters='1.83'), Person(name='Anna', hair_color='black', height_in_meters=None)])

提示

當結構描述容納多個實體的萃取時，如果文字中沒有相關資訊，它也允許模型透過提供空清單來萃取無實體。

這通常是好事！它允許指定實體上的必要屬性，而無需強制模型偵測此實體。

我們可以在此處看到 LangSmith 追蹤。

參考範例

LLM 應用程式的行為可以使用少量提示來引導。對於聊天模型，這可以採用一系列輸入和回應訊息配對的形式，以示範所需的行為。

例如，我們可以使用交替的 user 和 assistant 訊息來傳達符號的含義

messages = [
    {"role": "user", "content": "2 🦜 2"},
    {"role": "assistant", "content": "4"},
    {"role": "user", "content": "2 🦜 3"},
    {"role": "assistant", "content": "5"},
    {"role": "user", "content": "3 🦜 4"},
]

response = llm.invoke(messages)
print(response.content)

結構化輸出通常在底層使用工具調用。這通常涉及產生包含工具調用的 AI 訊息，以及包含工具調用結果的工具訊息。在這種情況下，訊息序列應該是什麼樣子？

不同的聊天模型提供者對有效的訊息序列施加不同的要求。有些會接受以下形式的（重複）訊息序列

使用者訊息
具有工具調用的 AI 訊息
具有結果的工具訊息

其他則需要包含某種回應的最終 AI 訊息。

LangChain 包含一個實用函數 tool_example_to_messages，它將為大多數模型提供者產生有效的序列。它僅需 Pydantic 表示的對應工具調用，即可簡化結構化少量範例的產生。

讓我們試試看。我們可以將輸入字串和所需的 Pydantic 物件配對轉換為可以提供給聊天模型的訊息序列。在底層，LangChain 會將工具調用格式化為每個提供者所需的形式。

注意：此版本的 tool_example_to_messages 需要 langchain-core>=0.3.20。

from langchain_core.utils.function_calling import tool_example_to_messages

examples = [
    (
        "The ocean is vast and blue. It's more than 20,000 feet deep.",
        Data(people=[]),
    ),
    (
        "Fiona traveled far from France to Spain.",
        Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)]),
    ),
]


messages = []

for txt, tool_call in examples:
    if tool_call.people:
        # This final message is optional for some providers
        ai_response = "Detected people."
    else:
        ai_response = "Detected no people."
    messages.extend(tool_example_to_messages(txt, [tool_call], ai_response=ai_response))

API 參考：tool_example_to_messages

檢查結果，我們看到這兩個範例配對產生了八個訊息

for message in messages:
    message.pretty_print()

================================[1m Human Message [0m=================================

The ocean is vast and blue. It's more than 20,000 feet deep.
==================================[1m Ai Message [0m==================================
Tool Calls:
  Data (d8f2e054-7fb9-417f-b28f-0447a775b2c3)
 Call ID: d8f2e054-7fb9-417f-b28f-0447a775b2c3
  Args:
    people: []
=================================[1m Tool Message [0m=================================

You have correctly called this tool.
==================================[1m Ai Message [0m==================================

Detected no people.
================================[1m Human Message [0m=================================

Fiona traveled far from France to Spain.
==================================[1m Ai Message [0m==================================
Tool Calls:
  Data (0178939e-a4b1-4d2a-a93e-b87f665cdfd6)
 Call ID: 0178939e-a4b1-4d2a-a93e-b87f665cdfd6
  Args:
    people: [{'name': 'Fiona', 'hair_color': None, 'height_in_meters': None}]
=================================[1m Tool Message [0m=================================

You have correctly called this tool.
==================================[1m Ai Message [0m==================================

Detected people.

讓我們比較使用和不使用這些訊息的效能。例如，讓我們傳遞一則我們不打算萃取任何人的訊息

message_no_extraction = {
    "role": "user",
    "content": "The solar system is large, but earth has only 1 moon.",
}

structured_llm = llm.with_structured_output(schema=Data)
structured_llm.invoke([message_no_extraction])

Data(people=[Person(name='Earth', hair_color='None', height_in_meters='0.00')])

在此範例中，模型很可能錯誤地產生人員記錄。

由於我們的少量範例包含「負面」範例，因此我們鼓勵模型在這種情況下正確運作

structured_llm.invoke(messages + [message_no_extraction])

Data(people=[])

提示

LangSmith 追蹤的執行揭示了傳送至聊天模型的確切訊息序列、產生的工具調用、延遲、Token 計數和其他中繼資料。

請參閱本指南，以了解有關使用參考範例的萃取工作流程的更多詳細資訊，包括如何整合提示範本和自訂範例訊息的產生。

後續步驟

現在您已了解 LangChain 萃取的基本知識，您已準備好繼續進行其餘的操作指南

新增範例：有關使用參考範例來提高效能的更多詳細資訊。
處理長文字：如果文字不適合 LLM 的上下文視窗，您應該怎麼做？
使用解析方法：使用基於提示的方法，以不支援工具/函數調用的模型進行萃取。

設定​

Jupyter Notebook​

安裝​

LangSmith​

結構描述​

萃取器​

多個實體​

參考範例​

後續步驟​

此頁面是否有幫助？

設定