跳到主要內容

Unstructured

這個筆記本涵蓋了如何使用 Unstructured 文件載入器來載入多種類型的文件。 Unstructured 目前支援載入文本文件、PowerPoint、HTML、PDF、圖像等等。

請參閱 此指南,以取得有關在本地設定 Unstructured 的更多說明,包括設定所需的系統相依性。

概述

整合細節

類別套件本地可序列化JS 支援
UnstructuredLoaderlangchain_unstructured

載入器功能

來源文件延遲載入原生非同步支援
UnstructuredLoader

設定

憑證

預設情況下,langchain-unstructured 安裝一個較小的佔用空間,這需要將分割邏輯轉移到 Unstructured API,這需要 API 金鑰。 如果您使用本地安裝,則不需要 API 金鑰。 要取得您的 API 金鑰,請前往 此網站 並取得 API 金鑰,然後在下面的儲存格中設定它

import getpass
import os

if "UNSTRUCTURED_API_KEY" not in os.environ:
os.environ["UNSTRUCTURED_API_KEY"] = getpass.getpass(
"Enter your Unstructured API key: "
)

安裝

一般安裝

執行此筆記本的其餘部分需要以下套件。

# Install package, compatible with API partitioning
%pip install --upgrade --quiet langchain-unstructured unstructured-client unstructured "unstructured[pdf]" python-magic

本地安裝

如果您想在本地執行分割邏輯,您將需要安裝系統相依性的組合,如 此處的 Unstructured 文件中所述。

例如,在 Mac 上,您可以使用以下命令安裝所需的相依性

# base dependencies
brew install libmagic poppler tesseract

# If parsing xml / html documents:
brew install libxml2 libxslt

您可以使用以下命令安裝本地所需的 pip 相依性

pip install "langchain-unstructured[local]"

初始化

UnstructuredLoader 允許從各種不同的文件類型載入。 要閱讀所有關於 unstructured 套件的資訊,請參閱他們的 文件/。 在此範例中,我們展示了從文字文件和 PDF 文件載入。

from langchain_unstructured import UnstructuredLoader

file_paths = [
"./example_data/layout-parser-paper.pdf",
"./example_data/state_of_the_union.txt",
]


loader = UnstructuredLoader(file_paths)
API 參考:UnstructuredLoader

載入

docs = loader.load()

docs[0]
INFO: pikepdf C++ to Python logger bridge initialized
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}, page_content='1 2 0 2')
print(docs[0].metadata)
{'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}

延遲載入

pages = []
for doc in loader.lazy_load():
pages.append(doc)

pages[0]
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}, page_content='1 2 0 2')

後處理

如果您需要在提取後對 unstructured 元素進行後處理,您可以將 str -> str 函數的清單傳遞給您實例化 UnstructuredLoader 時的 post_processors kwarg。 這也適用於其他 Unstructured 載入器。 以下是一個範例。

from langchain_unstructured import UnstructuredLoader
from unstructured.cleaners.core import clean_extra_whitespace

loader = UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
post_processors=[clean_extra_whitespace],
)

docs = loader.load()

docs[5:10]
API 參考:UnstructuredLoader
[Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'e9fa370aef7ee5c05744eb7bb7d9981b'}, page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'element_id': 'bde0b230a1aa488e3ce837d33015181b'}, page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': '54700f902899f0c8c90488fa8d825bce'}, page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'b650f5867bad9bb4e30384282c79bcfe'}, page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText', 'element_id': 'cfc957c94fe63c8fd7c7f4bcb56e75a7'}, page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.')]

Unstructured API

如果您想使用較小的套件啟動並執行,並獲得最新的分割,您可以 pip install unstructured-clientpip install langchain-unstructured。 有關 UnstructuredLoader 的更多資訊,請參閱 Unstructured 提供者頁面

當您傳入 api_key 並設定 partition_via_api=True 時,載入器將使用託管的 Unstructured serverless API 來處理您的文件。您可以在此處產生免費的 Unstructured API 金鑰。

如果您想自行託管 Unstructured API 或在本地端執行,請查看此處的說明。

from langchain_unstructured import UnstructuredLoader

loader = UnstructuredLoader(
file_path="example_data/fake.docx",
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_via_api=True,
)

docs = loader.load()
docs[0]
API 參考:UnstructuredLoader
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
Document(metadata={'source': 'example_data/fake.docx', 'category_depth': 0, 'filename': 'fake.docx', 'languages': ['por', 'cat'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': '56d531394823d81787d77a04462ed096'}, page_content='Lorem ipsum dolor sit amet.')

您也可以使用 UnstructuredLoader,透過單一 API 在 Unstructured API 中批次處理多個檔案。

loader = UnstructuredLoader(
file_path=["example_data/fake.docx", "example_data/fake-email.eml"],
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_via_api=True,
)

docs = loader.load()

print(docs[0].metadata["filename"], ": ", docs[0].page_content[:100])
print(docs[-1].metadata["filename"], ": ", docs[-1].page_content[:100])
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
``````output
fake.docx : Lorem ipsum dolor sit amet.
fake-email.eml : Violets are blue

Unstructured SDK Client

使用 Unstructured API 進行分割依賴於 Unstructured SDK Client

如果您想要自訂客戶端,您必須將 UnstructuredClient 實例傳遞給 UnstructuredLoader。以下範例展示如何自訂客戶端的功能,例如使用您自己的 requests.Session()、傳遞替代的 server_url 以及自訂 RetryConfig 物件。 有關自訂客戶端或 SDK 客戶端接受哪些其他參數的更多資訊,請參閱 Unstructured Python SDK 文件和 API 參數 文件的客戶端章節。請注意,所有 API 參數都應傳遞給 UnstructuredLoader

警告: 以下範例可能未使用最新版本的 UnstructuredClient,並且未來版本可能會發生重大變更。 有關最新的範例,請參閱 Unstructured Python SDK 文件。
import requests
from langchain_unstructured import UnstructuredLoader
from unstructured_client import UnstructuredClient
from unstructured_client.utils import BackoffStrategy, RetryConfig

client = UnstructuredClient(
api_key_auth=os.getenv(
"UNSTRUCTURED_API_KEY"
), # Note: the client API param is "api_key_auth" instead of "api_key"
client=requests.Session(), # Define your own requests session
server_url="https://api.unstructuredapp.io/general/v0/general", # Define your own api url
retry_config=RetryConfig(
strategy="backoff",
retry_connection_errors=True,
backoff=BackoffStrategy(
initial_interval=500,
max_interval=60000,
exponent=1.5,
max_elapsed_time=900000,
),
), # Define your own retry config
)

loader = UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
partition_via_api=True,
client=client,
split_pdf_page=True,
split_pdf_page_range=[1, 10],
)

docs = loader.load()

print(docs[0].metadata["filename"], ": ", docs[0].page_content[:100])
API 參考:UnstructuredLoader
INFO: Preparing to split document for partition.
INFO: Concurrency level set to 5
INFO: Splitting pages 1 to 10 (10 total)
INFO: Determined optimal split size of 2 pages.
INFO: Partitioning 5 files with 2 page(s) each.
INFO: Partitioning set #1 (pages 1-2).
INFO: Partitioning set #2 (pages 3-4).
INFO: Partitioning set #3 (pages 5-6).
INFO: Partitioning set #4 (pages 7-8).
INFO: Partitioning set #5 (pages 9-10).
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
INFO: Successfully partitioned set #3, elements added to the final result.
INFO: Successfully partitioned set #4, elements added to the final result.
INFO: Successfully partitioned set #5, elements added to the final result.
INFO: Successfully partitioned the document.
``````output
layout-parser-paper.pdf : LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

分塊 (Chunking)

UnstructuredLoader 不支援像舊的載入器 UnstructuredFileLoader 和其他載入器那樣使用 mode 作為分組文字的參數。它改為支援「分塊」(chunking)。 Unstructured 中的分塊與您可能熟悉的、基於純文字特徵(例如 "\n\n" 或 "\n" 之類的字元序列,可能表示段落邊界或列表項目邊界)形成塊的其他分塊機制不同。 相反,所有文件都使用關於每種文件格式的特定知識進行分割,將文件分割成語義單元(文件元素),並且只有在單個元素超過所需的最大塊大小時,我們才需要求助於文本分割。 通常,分塊組合連續的元素以形成盡可能大的塊,而不超過最大塊大小。 分塊產生 CompositeElement、Table 或 TableChunk 元素的序列。 每個「塊」都是這三種類型之一的實例。

有關分塊選項的更多詳細信息,請參閱此頁面,但要重現與 mode="single" 相同的行為,您可以設定 chunking_strategy="basic"max_characters=<some-really-big-number>include_orig_elements=False

from langchain_unstructured import UnstructuredLoader

loader = UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
chunking_strategy="basic",
max_characters=1000000,
include_orig_elements=False,
)

docs = loader.load()

print("Number of LangChain documents:", len(docs))
print("Length of text in the document:", len(docs[0].page_content))
API 參考:UnstructuredLoader
Number of LangChain documents: 1
Length of text in the document: 42772

載入網頁

UnstructuredLoader 在本地端執行時接受 web_url kwarg,該 kwarg 填充底層 Unstructured partitionurl 參數。 這允許解析遠端託管的文件,例如 HTML 網頁。

使用範例

from langchain_unstructured import UnstructuredLoader

loader = UnstructuredLoader(web_url="https://www.example.com")
docs = loader.load()

for doc in docs:
print(f"{doc}\n")
API 參考:UnstructuredLoader
page_content='Example Domain' metadata={'category_depth': 0, 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'Title', 'element_id': 'fdaa78d856f9d143aeeed85bf23f58f8'}

page_content='This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.' metadata={'languages': ['eng'], 'parent_id': 'fdaa78d856f9d143aeeed85bf23f58f8', 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'NarrativeText', 'element_id': '3652b8458b0688639f973fe36253c992'}

page_content='More information...' metadata={'category_depth': 0, 'link_texts': ['More information...'], 'link_urls': ['https://www.iana.org/domains/example'], 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'Title', 'element_id': '793ab98565d6f6d6f3a6d614e3ace2a9'}

API 參考

有關所有 UnstructuredLoader 功能和配置的詳細文件,請前往 API 參考: https://langchain-python.dev.org.tw/api_reference/unstructured/document_loaders/langchain_unstructured.document_loaders.UnstructuredLoader.html


此頁面是否對您有所幫助?