如何透過迭代精進來總結文本

LLM 可以從文本（包括大量文本）中總結和提取所需的資訊。在許多情況下，特別是當文本量相對於模型上下文窗口的大小而言很大時，將總結任務分解為較小的組件可能會有所幫助（或必要）。

迭代精進是總結長文本的一種策略。該策略如下：

將文本分割成較小的文件；
總結第一個文件；
根據下一個文件精進或更新結果；
重複執行文件序列，直到完成。

請注意，此策略未平行化。當子文件的理解取決於先前的上下文時，它尤其有效——例如，當總結小說或具有內在順序的文本主體時。

LangGraph 建構於 langchain-core 之上，非常適合此問題

LangGraph 允許串流個別步驟（例如連續總結），從而可以更好地控制執行；
LangGraph 的檢查點支援錯誤恢復、擴展人工介入迴路工作流程以及更輕鬆地整合到對話應用程式中。
由於它是從模組化組件組裝而成，因此也很容易擴展或修改（例如，整合工具呼叫或其他行為）。

下面，我們示範如何透過迭代精進來總結文本。

載入聊天模型

讓我們先載入一個聊天模型

選擇聊天模型

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

載入文件

接下來，我們需要一些文件來總結。下面，我們產生一些玩具文件以作說明。有關其他資料來源，請參閱文件載入器操作指南和整合頁面。總結教學還包括總結部落格文章的範例。

from langchain_core.documents import Document

documents = [
    Document(page_content="Apples are red", metadata={"title": "apple_book"}),
    Document(page_content="Blueberries are blue", metadata={"title": "blueberry_book"}),
    Document(page_content="Bananas are yelow", metadata={"title": "banana_book"}),
]

API 參考：Document

建立圖形

下面我們展示此過程的 LangGraph 實作

我們為初始總結產生一個簡單的鏈，該鏈選取第一個文件，將其格式化為提示，並使用我們的 LLM 執行推論。
我們產生第二個 refine_summary_chain，它對每個後續文件進行操作，以精進初始總結。

我們將需要安裝 langgraph

pip install -qU langgraph

import operator
from typing import List, Literal, TypedDict

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig
from langgraph.constants import Send
from langgraph.graph import END, START, StateGraph

# Initial summary
summarize_prompt = ChatPromptTemplate(
    [
        ("human", "Write a concise summary of the following: {context}"),
    ]
)
initial_summary_chain = summarize_prompt | llm | StrOutputParser()

# Refining the summary with new docs
refine_template = """
Produce a final summary.

Existing summary up to this point:
{existing_answer}

New context:
------------
{context}
------------

Given the new context, refine the original summary.
"""
refine_prompt = ChatPromptTemplate([("human", refine_template)])

refine_summary_chain = refine_prompt | llm | StrOutputParser()


# We will define the state of the graph to hold the document
# contents and summary. We also include an index to keep track
# of our position in the sequence of documents.
class State(TypedDict):
    contents: List[str]
    index: int
    summary: str


# We define functions for each node, including a node that generates
# the initial summary:
async def generate_initial_summary(state: State, config: RunnableConfig):
    summary = await initial_summary_chain.ainvoke(
        state["contents"][0],
        config,
    )
    return {"summary": summary, "index": 1}


# And a node that refines the summary based on the next document
async def refine_summary(state: State, config: RunnableConfig):
    content = state["contents"][state["index"]]
    summary = await refine_summary_chain.ainvoke(
        {"existing_answer": state["summary"], "context": content},
        config,
    )

    return {"summary": summary, "index": state["index"] + 1}


# Here we implement logic to either exit the application or refine
# the summary.
def should_refine(state: State) -> Literal["refine_summary", END]:
    if state["index"] >= len(state["contents"]):
        return END
    else:
        return "refine_summary"


graph = StateGraph(State)
graph.add_node("generate_initial_summary", generate_initial_summary)
graph.add_node("refine_summary", refine_summary)

graph.add_edge(START, "generate_initial_summary")
graph.add_conditional_edges("generate_initial_summary", should_refine)
graph.add_conditional_edges("refine_summary", should_refine)
app = graph.compile()

API 參考：StrOutputParser | ChatPromptTemplate | RunnableConfig | Send | StateGraph

LangGraph 允許繪製圖形結構以幫助視覺化其功能

from IPython.display import Image

Image(app.get_graph().draw_mermaid_png())

調用圖形

我們可以逐步執行，如下所示，印出總結及其精進過程

async for step in app.astream(
    {"contents": [doc.page_content for doc in documents]},
    stream_mode="values",
):
    if summary := step.get("summary"):
        print(summary)

Apples are characterized by their red color.
Apples are characterized by their red color, while blueberries are known for their blue hue.
Apples are characterized by their red color, blueberries are known for their blue hue, and bananas are recognized for their yellow color.

最終的 step 包含從整組文件合成的總結。

下一步

查看總結操作指南，以了解其他總結策略，包括那些專為大量文本設計的策略。

有關總結的更多詳細資訊，請參閱本教學。

另請參閱 LangGraph 文件，以了解有關使用 LangGraph 建立的詳細資訊。

載入聊天模型​

載入文件​

建立圖形​

調用圖形​

下一步​

此頁面是否有幫助？

載入聊天模型

載入文件

建立圖形

調用圖形

下一步