Dedoc

此範例示範如何結合 Dedoc 和 LangChain 作為 DocumentLoader 使用。

概觀

Dedoc 是一個開放原始碼程式庫/服務，可從各種格式的檔案中擷取文字、表格、附加檔案和文件結構（例如，標題、清單項目等）。

Dedoc 支援 DOCX、XLSX、PPTX、EML、HTML、PDF、圖像等等。支援格式的完整清單可以在此處找到。

整合詳細資訊

類別	套件	本地	可序列化	JS 支援
DedocFileLoader	langchain_community	❌	beta	❌
DedocPDFLoader	langchain_community	❌	beta	❌
DedocAPIFileLoader	langchain_community	❌	beta	❌

載入器功能

提供延遲載入和非同步載入的方法，但實際上，文件載入是同步執行的。

來源	文件延遲載入	非同步支援
DedocFileLoader	❌	❌
DedocPDFLoader	❌	❌
DedocAPIFileLoader	❌	❌

設定

若要存取 DedocFileLoader 和 DedocPDFLoader 文件載入器，您需要安裝 dedoc 整合套件。
若要存取 DedocAPIFileLoader，您需要執行 Dedoc 服務，例如 Docker 容器（詳細資訊請參閱文件）

docker pull dedocproject/dedoc
docker run -p 1231:1231

Dedoc 安裝說明請見此處。

# Install package
%pip install --quiet "dedoc[torch]"

Note: you may need to restart the kernel to use updated packages.

實例化

from langchain_community.document_loaders import DedocFileLoader

loader = DedocFileLoader("./example_data/state_of_the_union.txt")

API 參考：DedocFileLoader

載入

docs = loader.load()
docs[0].page_content[:100]

'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and t'

延遲載入

docs = loader.lazy_load()

for doc in docs:
    print(doc.page_content[:100])
    break


Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and t

API 參考

如需設定和呼叫 Dedoc 載入器的詳細資訊，請參閱 API 參考

載入任何檔案

為了自動處理支援格式的任何檔案，DedocFileLoader 可能很有用。檔案載入器會使用正確的副檔名自動偵測檔案類型。

檔案剖析程序可以透過 DedocFileLoader 類別初始化期間的 dedoc_kwargs 進行設定。此處提供了一些選項用法的基本範例，請參閱 DedocFileLoader 和 dedoc 文件的文件，以取得有關設定參數的更多詳細資訊。

基本範例

from langchain_community.document_loaders import DedocFileLoader

loader = DedocFileLoader("./example_data/state_of_the_union.txt")

docs = loader.load()

docs[0].page_content[:400]

API 參考：DedocFileLoader

'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\n\n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\n\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\n\n\nWith a duty to one another to the American people to '

分割模式

DedocFileLoader 支援將文件分割成不同類型部分（每個部分都會個別傳回）。為此目的，split 參數會與下列選項一起使用

document（預設值）：文件文字會以單一 langchain Document 物件傳回（不分割）；
page：將文件文字分割成頁面（適用於 PDF、DJVU、PPTX、PPT、ODP）；
node：將文件文字分割成 Dedoc 樹狀節點（標題節點、清單項目節點、原始文字節點）；
line：將文件文字分割成文字行。

loader = DedocFileLoader(
    "./example_data/layout-parser-paper.pdf",
    split="page",
    pages=":2",
)

docs = loader.load()

len(docs)

處理表格

當 with_tables 參數在載入器初始化期間設定為 True 時，DedocFileLoader 支援表格處理（預設情況下 with_tables=True）。

表格不會分割 - 每個表格對應到一個 langchain Document 物件。對於表格，Document 物件具有額外的 metadata 欄位 type="table" 和具有表格 HTML 表示法的 text_as_html。

loader = DedocFileLoader("./example_data/mlb_teams_2012.csv")

docs = loader.load()

docs[1].metadata["type"], docs[1].metadata["text_as_html"][:200]

('table',
 '<table border="1" style="border-collapse: collapse; width: 100%;">\n<tbody>\n<tr>\n<td colspan="1" rowspan="1">Team</td>\n<td colspan="1" rowspan="1"> &quot;Payroll (millions)&quot;</td>\n<td colspan="1" r')

處理附加檔案

當 with_attachments 在載入器初始化期間設定為 True 時，DedocFileLoader 支援附加檔案處理（預設情況下 with_attachments=False）。

附件會根據 split 參數分割。對於附件，langchain Document 物件具有額外的中繼資料欄位 type="attachment"。

loader = DedocFileLoader(
    "./example_data/fake-email-attachment.eml",
    with_attachments=True,
)

docs = loader.load()

docs[1].metadata["type"], docs[1].page_content

('attachment',
 '\nContent-Type\nmultipart/mixed; boundary="0000000000005d654405f082adb7"\nDate\nFri, 23 Dec 2022 12:08:48 -0600\nFrom\nMallori Harrell <mallori@unstructured.io>\nMIME-Version\n1.0\nMessage-ID\n<CAPgNNXSzLVJ-d1OCX_TjFgJU7ugtQrjFybPtAMmmYZzphxNFYg@mail.gmail.com>\nSubject\nFake email with attachment\nTo\nMallori Harrell <mallori@unstructured.io>')

載入 PDF 檔案

如果您只想處理 PDF 文件，您可以使用僅支援 PDF 的 DedocPDFLoader。載入器支援相同的參數，用於文件分割、表格和附件擷取。

Dedoc 可以擷取具有或不具有文字圖層的 PDF，並自動偵測其存在和正確性。有多種 PDF 處理常式可用，您可以使用 pdf_with_text_layer 參數來選擇其中一種。請參閱參數說明以取得更多詳細資訊。

對於沒有文字圖層的 PDF，應安裝 Tesseract OCR 及其語言套件。在這種情況下，此說明可能很有用。

from langchain_community.document_loaders import DedocPDFLoader

loader = DedocPDFLoader(
    "./example_data/layout-parser-paper.pdf", pdf_with_text_layer="true", pages="2:2"
)

docs = loader.load()

docs[0].page_content[:400]

API 參考：DedocPDFLoader

'\n2\n\nZ. Shen et al.\n\n37], layout detection [38, 22], table detection [26], and scene text detection [4].\n\nA generalized learning-based framework dramatically reduces the need for the\n\nmanual speciﬁcation of complicated rules, which is the status quo with traditional\n\nmethods. DL has the potential to transform DIA pipelines and beneﬁt a broad\n\nspectrum of large-scale document digitization projects.\n'

Dedoc API

如果您想要在較少設定的情況下啟動並執行，則可以使用 Dedoc 作為服務。DedocAPIFileLoader 可以在不安裝 dedoc 程式庫的情況下使用。 載入器支援與 DedocFileLoader 相同的參數，並且還會自動偵測輸入檔案類型。

若要使用 DedocAPIFileLoader，您應該執行 Dedoc 服務，例如 Docker 容器（詳細資訊請參閱文件）

docker pull dedocproject/dedoc
docker run -p 1231:1231

請勿在您的程式碼中使用我們的示範 URL https://dedoc-readme.hf.space。

from langchain_community.document_loaders import DedocAPIFileLoader

loader = DedocAPIFileLoader(
    "./example_data/state_of_the_union.txt",
    url="https://dedoc-readme.hf.space",
)

docs = loader.load()

docs[0].page_content[:400]

API 參考：DedocAPIFileLoader

'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\n\n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\n\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\n\n\nWith a duty to one another to the American people to '

文件載入器概念指南
文件載入器操作指南

概觀​

整合詳細資訊​

載入器功能​

設定​

實例化​

載入​

延遲載入​

API 參考​

載入任何檔案​

基本範例​

分割模式​

處理表格​

處理附加檔案​

載入 PDF 檔案​

Dedoc API​

相關內容​

此頁面是否對您有幫助？

概觀