跳到主要內容
Open In ColabOpen on GitHub

Unstructured

本筆記本涵蓋如何使用 Unstructured 文件載入器 來載入多種檔案類型。Unstructured 目前支援載入文字檔、PowerPoint、html、pdf、圖片等等。

請參閱本指南,以取得更多關於在本機設定 Unstructured 的說明,包括設定必要的系統相依性。

總覽

整合細節

類別套件本地可序列化JS 支援
UnstructuredLoaderlangchain_unstructured

載入器功能

來源文件延遲載入原生非同步支援
UnstructuredLoader

設定

憑證

預設情況下,langchain-unstructured 安裝的佔用空間較小,需要將分割邏輯卸載到 Unstructured API,這需要 API 金鑰。如果您使用本機安裝,則不需要 API 金鑰。要取得您的 API 金鑰,請前往 此網站 並取得 API 金鑰,然後在下面的儲存格中設定它

import getpass
import os

if "UNSTRUCTURED_API_KEY" not in os.environ:
os.environ["UNSTRUCTURED_API_KEY"] = getpass.getpass(
"Enter your Unstructured API key: "
)

安裝

一般安裝

執行此筆記本的其餘部分需要以下套件。

# Install package, compatible with API partitioning
%pip install --upgrade --quiet langchain-unstructured unstructured-client unstructured "unstructured[pdf]" python-magic

本機安裝

如果您想在本機執行分割邏輯,您需要安裝系統相依性的組合,如 此處的 Unstructured 文件 中所述。

例如,在 Mac 上,您可以使用以下命令安裝所需的相依性

# base dependencies
brew install libmagic poppler tesseract

# If parsing xml / html documents:
brew install libxml2 libxslt

您可以使用以下命令安裝本機所需的 pip 相依性

pip install "langchain-unstructured[local]"

初始化

UnstructuredLoader 允許從各種不同的檔案類型載入。要閱讀關於 unstructured 套件的所有資訊,請參閱他們的 文件。在本範例中,我們展示從文字檔和 PDF 檔案載入。

from langchain_unstructured import UnstructuredLoader

file_paths = [
"./example_data/layout-parser-paper.pdf",
"./example_data/state_of_the_union.txt",
]


loader = UnstructuredLoader(file_paths)
API 參考:UnstructuredLoader

載入

docs = loader.load()

docs[0]
INFO: pikepdf C++ to Python logger bridge initialized
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}, page_content='1 2 0 2')
print(docs[0].metadata)
{'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}

延遲載入

pages = []
for doc in loader.lazy_load():
pages.append(doc)

pages[0]
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}, page_content='1 2 0 2')

後處理

如果您需要在提取後對 unstructured 元素進行後處理,您可以將 str -> str 函數列表傳遞給實例化 UnstructuredLoader 時的 post_processors kwarg。這也適用於其他 Unstructured 載入器。以下是一個範例。

from langchain_unstructured import UnstructuredLoader
from unstructured.cleaners.core import clean_extra_whitespace

loader = UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
post_processors=[clean_extra_whitespace],
)

docs = loader.load()

docs[5:10]
API 參考:UnstructuredLoader
[Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'e9fa370aef7ee5c05744eb7bb7d9981b'}, page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'element_id': 'bde0b230a1aa488e3ce837d33015181b'}, page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': '54700f902899f0c8c90488fa8d825bce'}, page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'b650f5867bad9bb4e30384282c79bcfe'}, page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText', 'element_id': 'cfc957c94fe63c8fd7c7f4bcb56e75a7'}, page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.')]

Unstructured API

如果您想使用較小的套件快速啟動並運行,並獲得最新的分割功能,您可以 pip install unstructured-clientpip install langchain-unstructured。有關 UnstructuredLoader 的更多資訊,請參閱 Unstructured 供應商頁面

當您傳入 api_key 並設定 partition_via_api=True 時,載入器將使用託管的 Unstructured 無伺服器 API 處理您的文件。您可以 在此處 產生免費的 Unstructured API 金鑰。

如果您想自行託管 Unstructured API 或在本機執行,請查看 此處 的說明。

from langchain_unstructured import UnstructuredLoader

loader = UnstructuredLoader(
file_path="example_data/fake.docx",
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_via_api=True,
)

docs = loader.load()
docs[0]
API 參考:UnstructuredLoader
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
Document(metadata={'source': 'example_data/fake.docx', 'category_depth': 0, 'filename': 'fake.docx', 'languages': ['por', 'cat'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': '56d531394823d81787d77a04462ed096'}, page_content='Lorem ipsum dolor sit amet.')

您也可以使用 UnstructuredLoader 在單一 API 中透過 Unstructured API 批次處理多個檔案。

loader = UnstructuredLoader(
file_path=["example_data/fake.docx", "example_data/fake-email.eml"],
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_via_api=True,
)

docs = loader.load()

print(docs[0].metadata["filename"], ": ", docs[0].page_content[:100])
print(docs[-1].metadata["filename"], ": ", docs[-1].page_content[:100])
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
``````output
fake.docx : Lorem ipsum dolor sit amet.
fake-email.eml : Violets are blue

Unstructured SDK 用戶端

使用 Unstructured API 進行分割依賴於 Unstructured SDK 用戶端

如果您想自訂用戶端,您必須將 UnstructuredClient 實例傳遞給 UnstructuredLoader。以下範例展示如何自訂用戶端的功能,例如使用您自己的 requests.Session()、傳遞替代的 server_url 以及自訂 RetryConfig 物件。有關自訂用戶端或 sdk 用戶端接受哪些其他參數的更多資訊,請參閱 Unstructured Python SDK 文件和 API 參數 文件的用戶端章節。請注意,所有 API 參數都應傳遞給 UnstructuredLoader

警告: 以下範例可能未使用最新版本的 UnstructuredClient,並且未來版本可能會出現重大變更。如需最新的範例,請參閱 Unstructured Python SDK 文件。
import requests
from langchain_unstructured import UnstructuredLoader
from unstructured_client import UnstructuredClient
from unstructured_client.utils import BackoffStrategy, RetryConfig

client = UnstructuredClient(
api_key_auth=os.getenv(
"UNSTRUCTURED_API_KEY"
), # Note: the client API param is "api_key_auth" instead of "api_key"
client=requests.Session(), # Define your own requests session
server_url="https://api.unstructuredapp.io/general/v0/general", # Define your own api url
retry_config=RetryConfig(
strategy="backoff",
retry_connection_errors=True,
backoff=BackoffStrategy(
initial_interval=500,
max_interval=60000,
exponent=1.5,
max_elapsed_time=900000,
),
), # Define your own retry config
)

loader = UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
partition_via_api=True,
client=client,
split_pdf_page=True,
split_pdf_page_range=[1, 10],
)

docs = loader.load()

print(docs[0].metadata["filename"], ": ", docs[0].page_content[:100])
API 參考:UnstructuredLoader
INFO: Preparing to split document for partition.
INFO: Concurrency level set to 5
INFO: Splitting pages 1 to 10 (10 total)
INFO: Determined optimal split size of 2 pages.
INFO: Partitioning 5 files with 2 page(s) each.
INFO: Partitioning set #1 (pages 1-2).
INFO: Partitioning set #2 (pages 3-4).
INFO: Partitioning set #3 (pages 5-6).
INFO: Partitioning set #4 (pages 7-8).
INFO: Partitioning set #5 (pages 9-10).
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
INFO: Successfully partitioned set #3, elements added to the final result.
INFO: Successfully partitioned set #4, elements added to the final result.
INFO: Successfully partitioned set #5, elements added to the final result.
INFO: Successfully partitioned the document.
``````output
layout-parser-paper.pdf : LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

分塊

UnstructuredLoader 不支援像舊版載入器 UnstructuredFileLoader 和其他載入器那樣使用 mode 作為參數來分組文字。相反,它支援「分塊」。Unstructured 中的分塊與您可能熟悉的基於純文字特徵(例如「\n\n」或「\n」之類的字元序列,可能表示段落邊界或列表項目邊界)形成區塊的其他分塊機制不同。相反,所有文件都使用關於每種文件格式的特定知識進行分割,以將文件分割成語義單元(文件元素),並且只有當單個元素超過所需的最大區塊大小時,我們才需要求助於文字分割。通常,分塊將連續的元素組合起來,形成盡可能大的區塊,而不會超過最大區塊大小。分塊產生 CompositeElement、Table 或 TableChunk 元素序列。每個「區塊」都是這三種類型之一的實例。

請參閱 此頁面 以取得有關分塊選項的更多詳細資訊,但要重現與 mode="single" 相同的行為,您可以設定 chunking_strategy="basic"max_characters=<some-really-big-number>include_orig_elements=False

from langchain_unstructured import UnstructuredLoader

loader = UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
chunking_strategy="basic",
max_characters=1000000,
include_orig_elements=False,
)

docs = loader.load()

print("Number of LangChain documents:", len(docs))
print("Length of text in the document:", len(docs[0].page_content))
API 參考:UnstructuredLoader
Number of LangChain documents: 1
Length of text in the document: 42772

載入網頁

UnstructuredLoader 在本機執行時接受 web_url kwarg,它會填充底層 Unstructured 分割url 參數。這允許剖析遠端託管的文件,例如 HTML 網頁。

使用範例

from langchain_unstructured import UnstructuredLoader

loader = UnstructuredLoader(web_url="https://www.example.com")
docs = loader.load()

for doc in docs:
print(f"{doc}\n")
API 參考:UnstructuredLoader
page_content='Example Domain' metadata={'category_depth': 0, 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'Title', 'element_id': 'fdaa78d856f9d143aeeed85bf23f58f8'}

page_content='This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.' metadata={'languages': ['eng'], 'parent_id': 'fdaa78d856f9d143aeeed85bf23f58f8', 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'NarrativeText', 'element_id': '3652b8458b0688639f973fe36253c992'}

page_content='More information...' metadata={'category_depth': 0, 'link_texts': ['More information...'], 'link_urls': ['https://www.iana.org/domains/example'], 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'Title', 'element_id': '793ab98565d6f6d6f3a6d614e3ace2a9'}

API 參考

如需所有 UnstructuredLoader 功能和設定的詳細文件,請前往 API 參考:https://langchain-python.dev.org.tw/api_reference/unstructured/document_loaders/langchain_unstructured.document_loaders.UnstructuredLoader.html


此頁面是否對您有幫助?