LLM Sherpa

本筆記本涵蓋如何使用 LLM Sherpa 載入多種類型的文件。LLM Sherpa 支援不同的檔案格式，包括 DOCX、PPTX、HTML、TXT 和 XML。

LLMSherpaFileLoader 使用 LayoutPDFReader，它是 LLMSherpa 庫的一部分。此工具旨在剖析 PDF，同時保留其版面配置資訊，這通常在使用大多數 PDF 轉文字剖析器時會遺失。

以下是 LayoutPDFReader 的一些主要功能

它可以識別和提取章節和小節及其層級。
它結合行以形成段落。
它可以識別章節和段落之間的連結。
它可以提取表格以及找到表格的章節。
它可以識別和提取列表和巢狀列表。
它可以加入跨頁面的內容。
它可以移除重複的頁首和頁尾。
它可以移除浮水印。

查看 llmsherpa 文件。

資訊：此庫在某些 PDF 檔案上會失敗，因此請謹慎使用。

# Install package
# !pip install --upgrade --quiet llmsherpa

LLMSherpaFileLoader

在底層，LLMSherpaFileLoader 定義了一些策略來載入檔案內容：["sections", "chunks", "html", "text"]，設定 nlm-ingestor 以取得 llmsherpa_api_url 或使用預設值。

sections 策略：傳回剖析成章節的檔案

from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader

loader = LLMSherpaFileLoader(
    file_path="https://arxiv.org/pdf/2402.14207.pdf",
    new_indent_parser=True,
    apply_ocr=True,
    strategy="sections",
    llmsherpa_api_url="https://127.0.0.1:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()

API 參考：LLMSherpaFileLoader

docs[1]

Document(page_content='Abstract\nWe study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.\nThis underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing.\nWe propose STORM, a writing system for the Synthesis of Topic Outlines through\nReferences\nFull-length Article\nTopic\nOutline\n2022 Winter Olympics\nOpening Ceremony\nResearch via Question Asking\nRetrieval and Multi-perspective Question Asking.\nSTORM models the pre-writing stage by\nLLM\n(1) discovering diverse perspectives in researching the given topic, (2) simulating conversations where writers carrying different perspectives pose questions to a topic expert grounded on trusted Internet sources, (3) curating the collected information to create an outline.\nFor evaluation, we curate FreshWiki, a dataset of recent high-quality Wikipedia articles, and formulate outline assessments to evaluate the pre-writing stage.\nWe further gather feedback from experienced Wikipedia editors.\nCompared to articles generated by an outlinedriven retrieval-augmented baseline, more of STORM’s articles are deemed to be organized (by a 25% absolute increase) and broad in coverage (by 10%).\nThe expert feedback also helps identify new challenges for generating grounded long articles, such as source bias transfer and over-association of unrelated facts.\n1. Can you provide any information about the transportation arrangements for the opening ceremony?\nLLM\n2. Can you provide any information about the budget for the 2022 Winter Olympics opening ceremony?…\nLLM- Role1\nLLM- Role2\nLLM- Role1', metadata={'source': 'https://arxiv.org/pdf/2402.14207.pdf', 'section_number': 1, 'section_title': 'Abstract'})

len(docs)

chunks 策略：傳回剖析成區塊的檔案

from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader

loader = LLMSherpaFileLoader(
    file_path="https://arxiv.org/pdf/2402.14207.pdf",
    new_indent_parser=True,
    apply_ocr=True,
    strategy="chunks",
    llmsherpa_api_url="https://127.0.0.1:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()

API 參考：LLMSherpaFileLoader

docs[1]

Document(page_content='Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models\nStanford University {shaoyj, yuchengj, tkanell, peterxu, okhattab}@stanford.edu lam@cs.stanford.edu', metadata={'source': 'https://arxiv.org/pdf/2402.14207.pdf', 'chunk_number': 1, 'chunk_type': 'para'})

len(docs)

html 策略：將檔案作為一個 html 文件傳回

from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader

loader = LLMSherpaFileLoader(
    file_path="https://arxiv.org/pdf/2402.14207.pdf",
    new_indent_parser=True,
    apply_ocr=True,
    strategy="html",
    llmsherpa_api_url="https://127.0.0.1:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()

API 參考：LLMSherpaFileLoader

docs[0].page_content[:400]

'<html><h1>Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models</h1><table><th><td colSpan=1>Yijia Shao</td><td colSpan=1>Yucheng Jiang</td><td colSpan=1>Theodore A. Kanell</td><td colSpan=1>Peter Xu</td></th><tr><td colSpan=1></td><td colSpan=1>Omar Khattab</td><td colSpan=1>Monica S. Lam</td><td colSpan=1></td></tr></table><p>Stanford University {shaoyj, yuchengj, '

len(docs)

text 策略：將檔案作為一個文字文件傳回

from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader

loader = LLMSherpaFileLoader(
    file_path="https://arxiv.org/pdf/2402.14207.pdf",
    new_indent_parser=True,
    apply_ocr=True,
    strategy="text",
    llmsherpa_api_url="https://127.0.0.1:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()

API 參考：LLMSherpaFileLoader

docs[0].page_content[:400]

'Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models\n | Yijia Shao | Yucheng Jiang | Theodore A. Kanell | Peter Xu\n | --- | --- | --- | ---\n |  | Omar Khattab | Monica S. Lam | \n\nStanford University {shaoyj, yuchengj, tkanell, peterxu, okhattab}@stanford.edu lam@cs.stanford.edu\nAbstract\nWe study how to apply large language models to write grounded and organized long'

len(docs)

文件載入器概念指南
文件載入器操作指南

LLMSherpaFileLoader​

sections 策略：傳回剖析成章節的檔案​

chunks 策略：傳回剖析成區塊的檔案​

html 策略：將檔案作為一個 html 文件傳回​

text 策略：將檔案作為一個文字文件傳回​

相關連結​

此頁面是否對您有幫助？

LLMSherpaFileLoader

sections 策略：傳回剖析成章節的檔案

chunks 策略：傳回剖析成區塊的檔案

html 策略：將檔案作為一個 html 文件傳回

text 策略：將檔案作為一個文字文件傳回

相關連結