跳到主要內容
Open In ColabOpen on GitHub

PyMuPDF4LLMLoader

本筆記本提供快速概觀,以開始使用 PyMuPDF4LLM 文件載入器。如需所有 PyMuPDF4LLMLoader 功能和設定的詳細文件,請前往 GitHub 儲存庫

概觀

整合詳細資訊

類別套件本地可序列化JS 支援
PyMuPDF4LLMLoaderlangchain_pymupdf4llm

載入器功能

來源文件延遲載入原生非同步支援擷取圖片擷取表格
PyMuPDF4LLMLoader

設定

若要存取 PyMuPDF4LLM 文件載入器,您需要安裝 langchain-pymupdf4llm 整合套件。

憑證

使用 PyMuPDF4LLMLoader 不需要憑證。

如果您想要取得模型呼叫的自動化最佳追蹤,您也可以設定您的 LangSmith API 金鑰,方法是取消註解下方內容

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

安裝

安裝 langchain_communitylangchain-pymupdf4llm

%pip install -qU langchain_community langchain-pymupdf4llm
Note: you may need to restart the kernel to use updated packages.

初始化

現在我們可以實例化我們的模型物件並載入文件

from langchain_pymupdf4llm import PyMuPDF4LLMLoader

file_path = "./example_data/layout-parser-paper.pdf"
loader = PyMuPDF4LLMLoader(file_path)

載入

docs = loader.load()
docs[0]
Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-06-22T01:27:10+00:00', 'source': './example_data/layout-parser-paper.pdf', 'file_path': './example_data/layout-parser-paper.pdf', 'total_pages': 16, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'trapped': '', 'modDate': 'D:20210622012710Z', 'creationDate': 'D:20210622012710Z', 'page': 0}, page_content='\`\`\`\nLayoutParser: A Unified Toolkit for Deep\n\n## Learning Based Document Image Analysis\n\n\`\`\`\n\nZejiang Shen[1] (�), Ruochen Zhang[2], Melissa Dell[3], Benjamin Charles Germain\nLee[4], Jacob Carlson[3], and Weining Li[5]\n\n1 Allen Institute for AI\n\`\`\`\n              shannons@allenai.org\n\n\`\`\`\n2 Brown University\n\`\`\`\n             ruochen zhang@brown.edu\n\n\`\`\`\n3 Harvard University\n_{melissadell,jacob carlson}@fas.harvard.edu_\n4 University of Washington\n\`\`\`\n              bcgl@cs.washington.edu\n\n\`\`\`\n5 University of Waterloo\n\`\`\`\n              w422li@uwaterloo.ca\n\n\`\`\`\n\n**Abstract. Recent advances in document image analysis (DIA) have been**\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model configurations complicate the easy reuse of important innovations by a wide audience. Though there have been on-going\nefforts to improve reusability and simplify deep learning (DL) model\ndevelopment in disciplines like natural language processing and computer\nvision, none of them are optimized for challenges in the domain of DIA.\nThis represents a major gap in the existing toolkit, as DIA is central to\nacademic research across a wide range of disciplines in the social sciences\nand humanities. This paper introduces LayoutParser, an open-source\nlibrary for streamlining the usage of DL in DIA research and applications. The core LayoutParser library comes with a set of simple and\nintuitive interfaces for applying and customizing DL models for layout detection, character recognition, and many other document processing tasks.\nTo promote extensibility, LayoutParser also incorporates a community\nplatform for sharing both pre-trained models and full document digitization pipelines. We demonstrate that LayoutParser is helpful for both\nlightweight and large-scale digitization pipelines in real-word use cases.\n[The library is publicly available at https://layout-parser.github.io.](https://layout-parser.github.io)\n\n**Keywords: Document Image Analysis · Deep Learning · Layout Analysis**\n\n    - Character Recognition · Open Source library · Toolkit.\n\n### 1 Introduction\n\n\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\ndocument image analysis (DIA) tasks including document image classification [11,\n\n')
import pprint

pprint.pp(docs[0].metadata)
{'producer': 'pdfTeX-1.40.21',
'creator': 'LaTeX with hyperref',
'creationdate': '2021-06-22T01:27:10+00:00',
'source': './example_data/layout-parser-paper.pdf',
'file_path': './example_data/layout-parser-paper.pdf',
'total_pages': 16,
'format': 'PDF 1.5',
'title': '',
'author': '',
'subject': '',
'keywords': '',
'moddate': '2021-06-22T01:27:10+00:00',
'trapped': '',
'modDate': 'D:20210622012710Z',
'creationDate': 'D:20210622012710Z',
'page': 0}

延遲載入

pages = []
for doc in loader.lazy_load():
pages.append(doc)
if len(pages) >= 10:
# do some paged operation, e.g.
# index.upsert(page)

pages = []
len(pages)
6
from IPython.display import Markdown, display

part = pages[0].page_content[778:1189]
print(part)
# Markdown rendering
display(Markdown(part))
pprint.pp(pages[0].metadata)
{'producer': 'pdfTeX-1.40.21',
'creator': 'LaTeX with hyperref',
'creationdate': '2021-06-22T01:27:10+00:00',
'source': './example_data/layout-parser-paper.pdf',
'file_path': './example_data/layout-parser-paper.pdf',
'total_pages': 16,
'format': 'PDF 1.5',
'title': '',
'author': '',
'subject': '',
'keywords': '',
'moddate': '2021-06-22T01:27:10+00:00',
'trapped': '',
'modDate': 'D:20210622012710Z',
'creationDate': 'D:20210622012710Z',
'page': 10}

metadata 屬性至少包含下列鍵

  • source
  • page (如果在page模式下)
  • total_page
  • creationdate
  • creator
  • producer

其他 metadata 專屬於每個剖析器。這些資訊片段可能很有用 (例如,用於分類您的 PDF)。

分割模式與自訂頁面分隔符號

載入 PDF 檔案時,您可以透過兩種不同的方式分割它

  • 依頁面
  • 作為單一文字流

預設情況下,PyMuPDF4LLMLoader 會依頁面分割 PDF。

依頁面擷取 PDF。每個頁面都擷取為 langchain Document 物件:

loader = PyMuPDF4LLMLoader(
"./example_data/layout-parser-paper.pdf",
mode="page",
)
docs = loader.load()

print(len(docs))
pprint.pp(docs[0].metadata)
16
{'producer': 'pdfTeX-1.40.21',
'creator': 'LaTeX with hyperref',
'creationdate': '2021-06-22T01:27:10+00:00',
'source': './example_data/layout-parser-paper.pdf',
'file_path': './example_data/layout-parser-paper.pdf',
'total_pages': 16,
'format': 'PDF 1.5',
'title': '',
'author': '',
'subject': '',
'keywords': '',
'moddate': '2021-06-22T01:27:10+00:00',
'trapped': '',
'modDate': 'D:20210622012710Z',
'creationDate': 'D:20210622012710Z',
'page': 0}

在此模式下,pdf 會依頁面分割,且產生的 Documents metadata 包含 page (頁碼)。但在某些情況下,我們可能希望將 pdf 作為單一文字流處理 (因此我們不會將某些段落切成兩半)。在這種情況下,您可以使用 single 模式

將整個 PDF 擷取為單一 langchain Document 物件:

loader = PyMuPDF4LLMLoader(
"./example_data/layout-parser-paper.pdf",
mode="single",
)
docs = loader.load()

print(len(docs))
pprint.pp(docs[0].metadata)
1
{'producer': 'pdfTeX-1.40.21',
'creator': 'LaTeX with hyperref',
'creationdate': '2021-06-22T01:27:10+00:00',
'source': './example_data/layout-parser-paper.pdf',
'file_path': './example_data/layout-parser-paper.pdf',
'total_pages': 16,
'format': 'PDF 1.5',
'title': '',
'author': '',
'subject': '',
'keywords': '',
'moddate': '2021-06-22T01:27:10+00:00',
'trapped': '',
'modDate': 'D:20210622012710Z',
'creationDate': 'D:20210622012710Z'}

從邏輯上講,在此模式下,page (page_number) metadata 會消失。以下說明如何在文字流中清楚識別頁面結尾的位置

新增自訂 pages_delimiter 以識別 single 模式下頁面結尾的位置:

loader = PyMuPDF4LLMLoader(
"./example_data/layout-parser-paper.pdf",
mode="single",
pages_delimiter="\n-------THIS IS A CUSTOM END OF PAGE-------\n\n",
)
docs = loader.load()

part = docs[0].page_content[10663:11317]
print(part)
display(Markdown(part))

預設的 pages_delimiter 是 \n-----\n\n。但這可以只是 \n 或 \f 以清楚指示換頁,或是 <!-- PAGE BREAK --> 以便在 Markdown 檢視器中無縫注入而不會產生視覺效果。

從 PDF 擷取圖片

您可以從 PDF 擷取圖片 (以文字形式),並可選擇三種不同的解決方案

  • rapidOCR (輕量級光學字元辨識工具)
  • Tesseract (高精度的 OCR 工具)
  • 多模態語言模型

結果會插入頁面文字的結尾。

使用 rapidOCR 從 PDF 擷取圖片:

%pip install -qU rapidocr-onnxruntime pillow
Note: you may need to restart the kernel to use updated packages.
from langchain_community.document_loaders.parsers import RapidOCRBlobParser

loader = PyMuPDF4LLMLoader(
"./example_data/layout-parser-paper.pdf",
mode="page",
extract_images=True,
images_parser=RapidOCRBlobParser(),
)
docs = loader.load()

part = docs[5].page_content[1863:]
print(part)
display(Markdown(part))
API 參考:RapidOCRBlobParser

請注意,RapidOCR 旨在與中文和英文搭配使用,而非其他語言。

使用 Tesseract 從 PDF 擷取圖片:

%pip install -qU pytesseract
Note: you may need to restart the kernel to use updated packages.
from langchain_community.document_loaders.parsers import TesseractBlobParser

loader = PyMuPDF4LLMLoader(
"./example_data/layout-parser-paper.pdf",
mode="page",
extract_images=True,
images_parser=TesseractBlobParser(),
)
docs = loader.load()

print(docs[5].page_content[1863:])
API 參考:TesseractBlobParser

使用多模態模型從 PDF 擷取圖片:

%pip install -qU langchain_openai
Note: you may need to restart the kernel to use updated packages.
import os

from dotenv import load_dotenv

load_dotenv()
True
from getpass import getpass

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key =")
from langchain_community.document_loaders.parsers import LLMImageBlobParser
from langchain_openai import ChatOpenAI

loader = PyMuPDF4LLMLoader(
"./example_data/layout-parser-paper.pdf",
mode="page",
extract_images=True,
images_parser=LLMImageBlobParser(
model=ChatOpenAI(model="gpt-4o-mini", max_tokens=1024)
),
)
docs = loader.load()

print(docs[5].page_content[1863:])

從 PDF 擷取表格

使用 PyMUPDF4LLM,您可以從 PDF 擷取 markdown 格式的表格

loader = PyMuPDF4LLMLoader(
"./example_data/layout-parser-paper.pdf",
mode="page",
# "lines_strict" is the default strategy and
# is the most accurate for tables with column and row lines,
# but may not work well with all documents.
# "lines" is a less strict strategy that may work better with
# some documents.
# "text" is the least strict strategy and may work better
# with documents that do not have tables with lines.
table_strategy="lines",
)
docs = loader.load()

part = docs[4].page_content[3210:]
print(part)
display(Markdown(part))

使用檔案

許多文件載入器都涉及剖析檔案。此類載入器之間的差異通常源於檔案的剖析方式,而非檔案的載入方式。例如,您可以使用 open 讀取 PDF 或 markdown 檔案的二進位內容,但您需要不同的剖析邏輯才能將該二進位資料轉換為文字。

因此,將剖析邏輯與載入邏輯分離會很有幫助,這可讓您更輕鬆地重複使用給定的剖析器,而無需考慮資料的載入方式。您可以使用此策略來分析不同的檔案,並使用相同的剖析參數。

from langchain_community.document_loaders import FileSystemBlobLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_pymupdf4llm import PyMuPDF4LLMParser

loader = GenericLoader(
blob_loader=FileSystemBlobLoader(
path="./example_data/",
glob="*.pdf",
),
blob_parser=PyMuPDF4LLMParser(),
)
docs = loader.load()

part = docs[0].page_content[:562]
print(part)
display(Markdown(part))

API 參考

如需所有 PyMuPDF4LLMLoader 功能和設定的詳細文件,請前往 GitHub 儲存庫:https://github.com/lakinduboteju/langchain-pymupdf4llm


此頁面是否對您有幫助?