Unstructured
這個筆記本涵蓋了如何使用 Unstructured
文件載入器來載入多種類型的文件。 Unstructured
目前支援載入文本文件、PowerPoint、HTML、PDF、圖像等等。
請參閱 此指南,以取得有關在本地設定 Unstructured 的更多說明,包括設定所需的系統相依性。
概述
整合細節
類別 | 套件 | 本地 | 可序列化 | JS 支援 |
---|---|---|---|---|
UnstructuredLoader | langchain_unstructured | ✅ | ❌ | ✅ |
載入器功能
來源 | 文件延遲載入 | 原生非同步支援 |
---|---|---|
UnstructuredLoader | ✅ | ❌ |
設定
憑證
預設情況下,langchain-unstructured
安裝一個較小的佔用空間,這需要將分割邏輯轉移到 Unstructured API,這需要 API 金鑰。 如果您使用本地安裝,則不需要 API 金鑰。 要取得您的 API 金鑰,請前往 此網站 並取得 API 金鑰,然後在下面的儲存格中設定它
import getpass
import os
if "UNSTRUCTURED_API_KEY" not in os.environ:
os.environ["UNSTRUCTURED_API_KEY"] = getpass.getpass(
"Enter your Unstructured API key: "
)
安裝
一般安裝
執行此筆記本的其餘部分需要以下套件。
# Install package, compatible with API partitioning
%pip install --upgrade --quiet langchain-unstructured unstructured-client unstructured "unstructured[pdf]" python-magic
本地安裝
如果您想在本地執行分割邏輯,您將需要安裝系統相依性的組合,如 此處的 Unstructured 文件中所述。
例如,在 Mac 上,您可以使用以下命令安裝所需的相依性
# base dependencies
brew install libmagic poppler tesseract
# If parsing xml / html documents:
brew install libxml2 libxslt
您可以使用以下命令安裝本地所需的 pip
相依性
pip install "langchain-unstructured[local]"
初始化
UnstructuredLoader
允許從各種不同的文件類型載入。 要閱讀所有關於 unstructured
套件的資訊,請參閱他們的 文件/。 在此範例中,我們展示了從文字文件和 PDF 文件載入。
from langchain_unstructured import UnstructuredLoader
file_paths = [
"./example_data/layout-parser-paper.pdf",
"./example_data/state_of_the_union.txt",
]
loader = UnstructuredLoader(file_paths)
載入
docs = loader.load()
docs[0]
INFO: pikepdf C++ to Python logger bridge initialized
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}, page_content='1 2 0 2')
print(docs[0].metadata)
{'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}
延遲載入
pages = []
for doc in loader.lazy_load():
pages.append(doc)
pages[0]
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}, page_content='1 2 0 2')
後處理
如果您需要在提取後對 unstructured
元素進行後處理,您可以將 str
-> str
函數的清單傳遞給您實例化 UnstructuredLoader
時的 post_processors
kwarg。 這也適用於其他 Unstructured 載入器。 以下是一個範例。
from langchain_unstructured import UnstructuredLoader
from unstructured.cleaners.core import clean_extra_whitespace
loader = UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
post_processors=[clean_extra_whitespace],
)
docs = loader.load()
docs[5:10]
[Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'e9fa370aef7ee5c05744eb7bb7d9981b'}, page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'element_id': 'bde0b230a1aa488e3ce837d33015181b'}, page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': '54700f902899f0c8c90488fa8d825bce'}, page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'b650f5867bad9bb4e30384282c79bcfe'}, page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText', 'element_id': 'cfc957c94fe63c8fd7c7f4bcb56e75a7'}, page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.')]
Unstructured API
如果您想使用較小的套件啟動並執行,並獲得最新的分割,您可以 pip install unstructured-client
和 pip install langchain-unstructured
。 有關 UnstructuredLoader
的更多資訊,請參閱 Unstructured 提供者頁面。
當您傳入 api_key
並設定 partition_via_api=True
時,載入器將使用託管的 Unstructured serverless API 來處理您的文件。您可以在此處產生免費的 Unstructured API 金鑰。
如果您想自行託管 Unstructured API 或在本地端執行,請查看此處的說明。
from langchain_unstructured import UnstructuredLoader
loader = UnstructuredLoader(
file_path="example_data/fake.docx",
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_via_api=True,
)
docs = loader.load()
docs[0]
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
Document(metadata={'source': 'example_data/fake.docx', 'category_depth': 0, 'filename': 'fake.docx', 'languages': ['por', 'cat'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': '56d531394823d81787d77a04462ed096'}, page_content='Lorem ipsum dolor sit amet.')
您也可以使用 UnstructuredLoader
,透過單一 API 在 Unstructured API 中批次處理多個檔案。
loader = UnstructuredLoader(
file_path=["example_data/fake.docx", "example_data/fake-email.eml"],
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_via_api=True,
)
docs = loader.load()
print(docs[0].metadata["filename"], ": ", docs[0].page_content[:100])
print(docs[-1].metadata["filename"], ": ", docs[-1].page_content[:100])
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
``````output
fake.docx : Lorem ipsum dolor sit amet.
fake-email.eml : Violets are blue
Unstructured SDK Client
使用 Unstructured API 進行分割依賴於 Unstructured SDK Client。
如果您想要自訂客戶端,您必須將 UnstructuredClient
實例傳遞給 UnstructuredLoader
。以下範例展示如何自訂客戶端的功能,例如使用您自己的 requests.Session()
、傳遞替代的 server_url
以及自訂 RetryConfig
物件。 有關自訂客戶端或 SDK 客戶端接受哪些其他參數的更多資訊,請參閱 Unstructured Python SDK 文件和 API 參數 文件的客戶端章節。請注意,所有 API 參數都應傳遞給 UnstructuredLoader
。
import requests
from langchain_unstructured import UnstructuredLoader
from unstructured_client import UnstructuredClient
from unstructured_client.utils import BackoffStrategy, RetryConfig
client = UnstructuredClient(
api_key_auth=os.getenv(
"UNSTRUCTURED_API_KEY"
), # Note: the client API param is "api_key_auth" instead of "api_key"
client=requests.Session(), # Define your own requests session
server_url="https://api.unstructuredapp.io/general/v0/general", # Define your own api url
retry_config=RetryConfig(
strategy="backoff",
retry_connection_errors=True,
backoff=BackoffStrategy(
initial_interval=500,
max_interval=60000,
exponent=1.5,
max_elapsed_time=900000,
),
), # Define your own retry config
)
loader = UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
partition_via_api=True,
client=client,
split_pdf_page=True,
split_pdf_page_range=[1, 10],
)
docs = loader.load()
print(docs[0].metadata["filename"], ": ", docs[0].page_content[:100])
INFO: Preparing to split document for partition.
INFO: Concurrency level set to 5
INFO: Splitting pages 1 to 10 (10 total)
INFO: Determined optimal split size of 2 pages.
INFO: Partitioning 5 files with 2 page(s) each.
INFO: Partitioning set #1 (pages 1-2).
INFO: Partitioning set #2 (pages 3-4).
INFO: Partitioning set #3 (pages 5-6).
INFO: Partitioning set #4 (pages 7-8).
INFO: Partitioning set #5 (pages 9-10).
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
INFO: Successfully partitioned set #3, elements added to the final result.
INFO: Successfully partitioned set #4, elements added to the final result.
INFO: Successfully partitioned set #5, elements added to the final result.
INFO: Successfully partitioned the document.
``````output
layout-parser-paper.pdf : LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis
分塊 (Chunking)
UnstructuredLoader
不支援像舊的載入器 UnstructuredFileLoader
和其他載入器那樣使用 mode
作為分組文字的參數。它改為支援「分塊」(chunking)。 Unstructured 中的分塊與您可能熟悉的、基於純文字特徵(例如 "\n\n" 或 "\n" 之類的字元序列,可能表示段落邊界或列表項目邊界)形成塊的其他分塊機制不同。 相反,所有文件都使用關於每種文件格式的特定知識進行分割,將文件分割成語義單元(文件元素),並且只有在單個元素超過所需的最大塊大小時,我們才需要求助於文本分割。 通常,分塊組合連續的元素以形成盡可能大的塊,而不超過最大塊大小。 分塊產生 CompositeElement、Table 或 TableChunk 元素的序列。 每個「塊」都是這三種類型之一的實例。
有關分塊選項的更多詳細信息,請參閱此頁面,但要重現與 mode="single"
相同的行為,您可以設定 chunking_strategy="basic"
、max_characters=<some-really-big-number>
和 include_orig_elements=False
。
from langchain_unstructured import UnstructuredLoader
loader = UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
chunking_strategy="basic",
max_characters=1000000,
include_orig_elements=False,
)
docs = loader.load()
print("Number of LangChain documents:", len(docs))
print("Length of text in the document:", len(docs[0].page_content))
Number of LangChain documents: 1
Length of text in the document: 42772
載入網頁
UnstructuredLoader
在本地端執行時接受 web_url
kwarg,該 kwarg 填充底層 Unstructured partition 的 url
參數。 這允許解析遠端託管的文件,例如 HTML 網頁。
使用範例
from langchain_unstructured import UnstructuredLoader
loader = UnstructuredLoader(web_url="https://www.example.com")
docs = loader.load()
for doc in docs:
print(f"{doc}\n")
page_content='Example Domain' metadata={'category_depth': 0, 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'Title', 'element_id': 'fdaa78d856f9d143aeeed85bf23f58f8'}
page_content='This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.' metadata={'languages': ['eng'], 'parent_id': 'fdaa78d856f9d143aeeed85bf23f58f8', 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'NarrativeText', 'element_id': '3652b8458b0688639f973fe36253c992'}
page_content='More information...' metadata={'category_depth': 0, 'link_texts': ['More information...'], 'link_urls': ['https://www.iana.org/domains/example'], 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'Title', 'element_id': '793ab98565d6f6d6f3a6d614e3ace2a9'}
API 參考
有關所有 UnstructuredLoader
功能和配置的詳細文件,請前往 API 參考: https://langchain-python.dev.org.tw/api_reference/unstructured/document_loaders/langchain_unstructured.document_loaders.UnstructuredLoader.html