如何在提取時處理長文本
當處理檔案(如 PDF)時,您很可能會遇到超出語言模型上下文窗口的文本。要處理此文本,請考慮以下策略
- 更換 LLM 選擇支援更大上下文窗口的不同 LLM。
- 暴力破解 將文件分塊,並從每個塊中提取內容。
- RAG 將文件分塊,為塊建立索引,並且僅從看起來「相關」的塊子集中提取內容。
請記住,這些策略具有不同的權衡,最佳策略可能取決於您正在設計的應用程式!
本指南示範如何實作策略 2 和 3。
設定
首先,我們將安裝本指南所需的依賴項
%pip install -qU langchain-community lxml faiss-cpu langchain-openai
Note: you may need to restart the kernel to use updated packages.
現在我們需要一些範例資料!讓我們下載一篇關於 維基百科的汽車 的文章,並將其作為 LangChain Document 載入。
import re
import requests
from langchain_community.document_loaders import BSHTMLLoader
# Download the content
response = requests.get("https://en.wikipedia.org/wiki/Car")
# Write it to a file
with open("car.html", "w", encoding="utf-8") as f:
f.write(response.text)
# Load it with an HTML parser
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
# Clean up code
# Replace consecutive new lines with a single new line
document.page_content = re.sub("\n\n+", "\n", document.page_content)
print(len(document.page_content))
78865
定義架構
依照提取教學,我們將使用 Pydantic 來定義我們希望提取的資訊的架構。在本例中,我們將提取「關鍵發展」(例如,重要的歷史事件)的列表,其中包括年份和描述。
請注意,我們還包括一個 evidence
鍵,並指示模型逐字提供文章中相關的文本句子。這使我們能夠將提取結果與原始文件中的文本(模型的重建)進行比較。
from typing import List, Optional
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field
class KeyDevelopment(BaseModel):
"""Information about a development in the history of cars."""
year: int = Field(
..., description="The year when there was an important historic development."
)
description: str = Field(
..., description="What happened in this year? What was the development?"
)
evidence: str = Field(
...,
description="Repeat in verbatim the sentence(s) from which the year and description information were extracted",
)
class ExtractionData(BaseModel):
"""Extracted information about key developments in the history of cars."""
key_developments: List[KeyDevelopment]
# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
# about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are an expert at identifying key historic development in text. "
"Only extract important historic developments. Extract nothing if no important information can be found in the text.",
),
("human", "{text}"),
]
)
建立提取器
讓我們選擇一個 LLM。由於我們正在使用工具調用,因此我們需要一個支援工具調用功能的模型。請參閱此表格以了解可用的 LLM。
pip install -qU "langchain[openai]"
import getpass
import os
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
from langchain.chat_models import init_chat_model
llm = init_chat_model("gpt-4o", model_provider="openai", temperature=0)
extractor = prompt | llm.with_structured_output(
schema=ExtractionData,
include_raw=False,
)
暴力破解方法
將文件分割成塊,以便每個塊都適合 LLM 的上下文窗口。
from langchain_text_splitters import TokenTextSplitter
text_splitter = TokenTextSplitter(
# Controls the size of each chunk
chunk_size=2000,
# Controls overlap between chunks
chunk_overlap=20,
)
texts = text_splitter.split_text(document.page_content)
使用 batch 功能在每個塊上並行運行提取!
您通常可以使用 .batch() 來平行化提取!.batch
在底層使用線程池來幫助您平行化工作負載。
如果您的模型透過 API 公開,這可能會加快您的提取流程!
# Limit just to the first 3 chunks
# so the code can be re-run quickly
first_few = texts[:3]
extractions = extractor.batch(
[{"text": text} for text in first_few],
{"max_concurrency": 5}, # limit the concurrency by passing max concurrency!
)
合併結果
從各個塊中提取資料後,我們需要將提取結果合併在一起。
key_developments = []
for extraction in extractions:
key_developments.extend(extraction.key_developments)
key_developments[:10]
[KeyDevelopment(year=1769, description='Nicolas-Joseph Cugnot built the first steam-powered road vehicle.', evidence='The French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769, while the Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
KeyDevelopment(year=1808, description='François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile.', evidence='The French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769, while the Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
KeyDevelopment(year=1886, description='Carl Benz invented the modern car, a practical, marketable automobile for everyday use, and patented his Benz Patent-Motorwagen.', evidence='The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when the German inventor Carl Benz patented his Benz Patent-Motorwagen.'),
KeyDevelopment(year=1901, description='The Oldsmobile Curved Dash became the first mass-produced car.', evidence='The 1901 Oldsmobile Curved Dash and the 1908 Ford Model T, both American cars, are widely considered the first mass-produced[3][4] and mass-affordable[5][6][7] cars, respectively.'),
KeyDevelopment(year=1908, description='The Ford Model T became the first mass-affordable car.', evidence='The 1901 Oldsmobile Curved Dash and the 1908 Ford Model T, both American cars, are widely considered the first mass-produced[3][4] and mass-affordable[5][6][7] cars, respectively.'),
KeyDevelopment(year=1885, description='Carl Benz built the original Benz Patent-Motorwagen, the first modern car.', evidence='The original Benz Patent-Motorwagen, the first modern car, built in 1885 and awarded the patent for the concept'),
KeyDevelopment(year=1881, description='Gustave Trouvé demonstrated a three-wheeled car powered by electricity.', evidence='In November 1881, French inventor Gustave Trouvé demonstrated a three-wheeled car powered by electricity at the International Exposition of Electricity.'),
KeyDevelopment(year=1888, description="Bertha Benz undertook the first road trip by car to prove the road-worthiness of her husband's invention.", evidence="In August 1888, Bertha Benz, the wife and business partner of Carl Benz, undertook the first road trip by car, to prove the road-worthiness of her husband's invention."),
KeyDevelopment(year=1896, description='Benz designed and patented the first internal-combustion flat engine, called boxermotor.', evidence='In 1896, Benz designed and patented the first internal-combustion flat engine, called boxermotor.'),
KeyDevelopment(year=1897, description='The first motor car in central Europe and one of the first factory-made cars in the world was produced by Czech company Nesselsdorfer Wagenbau (later renamed to Tatra), the Präsident automobil.', evidence='The first motor car in central Europe and one of the first factory-made cars in the world, was produced by Czech company Nesselsdorfer Wagenbau (later renamed to Tatra) in 1897, the Präsident automobil.')]
基於 RAG 的方法
另一個簡單的想法是將文本分塊,但不是從每個塊中提取資訊,而是專注於最相關的塊。
可能難以識別哪些塊是相關的。
例如,在我們在這裡使用的 car
文章中,文章的大部分內容都包含關鍵發展資訊。因此,透過使用 RAG,我們很可能會丟棄大量相關資訊。
我們建議您針對您的用例進行實驗,並確定此方法是否有效。
要實作基於 RAG 的方法
- 將您的文件分塊並為它們建立索引(例如,在向量儲存中);
- 使用向量儲存,在
extractor
鏈前面加上檢索步驟。
這是一個簡單的範例,它依賴於 FAISS
向量儲存。
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
texts = text_splitter.split_text(document.page_content)
vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(
search_kwargs={"k": 1}
) # Only extract from first document
在本例中,RAG 提取器僅查看頂部文件。
rag_extractor = {
"text": retriever | (lambda docs: docs[0].page_content) # fetch content of top doc
} | extractor
results = rag_extractor.invoke("Key developments associated with cars")
for key_development in results.key_developments:
print(key_development)
year=2006 description='Car-sharing services in the US experienced double-digit growth in revenue and membership.' evidence='in the US, some car-sharing services have experienced double-digit growth in revenue and membership growth between 2006 and 2007.'
year=2020 description='56 million cars were manufactured worldwide, with China producing the most.' evidence='In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year. The automotive industry in China produces by far the most (20 million in 2020).'
常見問題
不同的方法在成本、速度和準確性方面各有優缺點。
注意以下問題
- 分塊內容意味著如果資訊分散在多個塊中,LLM 可能無法提取資訊。
- 大塊重疊可能會導致相同資訊被提取兩次,因此請準備好刪除重複項!
- LLM 可能會編造資料。如果在大型文本中尋找單一事實並使用暴力破解方法,您最終可能會得到更多編造的資料。