Docling
Docling 解析 PDF、DOCX、PPTX、HTML 和其他格式為豐富的統一表示形式,包括文件佈局、表格等,使其準備好用於生成式 AI 工作流程,例如 RAG。
此整合透過 DoclingLoader
文件載入器提供 Docling 的功能。
概觀
所呈現的 DoclingLoader
組件使您能夠
- 輕鬆快速地在您的 LLM 應用程式中使用各種文件類型,以及
- 利用 Docling 的豐富格式進行進階的、文件原生的 grounding。
DoclingLoader
支援兩種不同的匯出模式
ExportType.DOC_CHUNKS
(預設):如果您希望將每個輸入文件分塊,然後將每個單獨的塊捕獲為下游的單獨 LangChain Document,或ExportType.MARKDOWN
:如果您希望將每個輸入文件捕獲為單獨的 LangChain Document
此範例允許透過參數 EXPORT_TYPE
探索兩種模式;根據設定的值,範例管線隨後會相應地設定。
設定
%pip install -qU langchain-docling
Note: you may need to restart the kernel to use updated packages.
為了獲得最佳轉換速度,請盡可能使用 GPU 加速;例如,如果在 Colab 上運行,請使用啟用 GPU 的運行時環境。
初始化
基本初始化如下所示
from langchain_docling import DoclingLoader
FILE_PATH = "https://arxiv.org/pdf/2408.09869"
loader = DoclingLoader(file_path=FILE_PATH)
對於進階使用,DoclingLoader
具有以下參數
file_path
:來源為單一字串 (URL 或本機檔案) 或其可迭代物件converter
(選用):要使用的任何特定 Docling 轉換器實例convert_kwargs
(選用):用於轉換執行的任何特定 kwargsexport_type
(選用):要使用的匯出模式:ExportType.DOC_CHUNKS
(預設) 或ExportType.MARKDOWN
md_export_kwargs
(選用):任何特定的 Markdown 匯出 kwargs (用於 Markdown 模式)chunker
(選用):要使用的任何特定 Docling 分塊器實例 (用於 doc-chunk 模式)meta_extractor
(選用):要使用的任何特定元數據提取器
載入
docs = loader.load()
Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors
注意:訊息顯示
"Token indices sequence length is longer than the specified maximum sequence length..."
在此情況下可以忽略 — 更多細節請參閱 這裡。
檢查一些範例文件
for d in docs[:3]:
print(f"- {d.page_content=}")
- d.page_content='arXiv:2408.09869v5 [cs.CL] 9 Dec 2024'
- d.page_content='Docling Technical Report\nVersion 1.0\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\nAI4K Group, IBM Research R¨uschlikon, Switzerland'
- d.page_content='Abstract\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'
延遲載入
文件也可以延遲方式載入
doc_iter = loader.lazy_load()
for doc in doc_iter:
pass # you can operate on `doc` here
端對端範例
import os
# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
- 以下範例管線使用 HuggingFace 的 Inference API;為了增加 LLM 配額,token 可以透過環境變數
HF_TOKEN
提供。- 此管線的依賴項可以如下所示安裝 (
--no-warn-conflicts
用於 Colab 的預先填充 Python 環境;可隨意移除以獲得更嚴格的使用)
%pip install -q --progress-bar off --no-warn-conflicts langchain-core langchain-huggingface langchain_milvus langchain python-dotenv
Note: you may need to restart the kernel to use updated packages.
定義管線參數
from pathlib import Path
from tempfile import mkdtemp
from dotenv import load_dotenv
from langchain_core.prompts import PromptTemplate
from langchain_docling.loader import ExportType
def _get_env_from_colab_or_os(key):
try:
from google.colab import userdata
try:
return userdata.get(key)
except userdata.SecretNotFoundError:
pass
except ImportError:
pass
return os.getenv(key)
load_dotenv()
HF_TOKEN = _get_env_from_colab_or_os("HF_TOKEN")
FILE_PATH = ["https://arxiv.org/pdf/2408.09869"] # Docling Technical Report
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
GEN_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"
EXPORT_TYPE = ExportType.DOC_CHUNKS
QUESTION = "Which are the main AI models in Docling?"
PROMPT = PromptTemplate.from_template(
"Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {input}\nAnswer:\n",
)
TOP_K = 3
MILVUS_URI = str(Path(mkdtemp()) / "docling.db")
API 參考:PromptTemplate
現在我們可以實例化我們的載入器並載入文件
from docling.chunking import HybridChunker
from langchain_docling import DoclingLoader
loader = DoclingLoader(
file_path=FILE_PATH,
export_type=EXPORT_TYPE,
chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
)
docs = loader.load()
Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors
決定分割
if EXPORT_TYPE == ExportType.DOC_CHUNKS:
splits = docs
elif EXPORT_TYPE == ExportType.MARKDOWN:
from langchain_text_splitters import MarkdownHeaderTextSplitter
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "Header_1"),
("##", "Header_2"),
("###", "Header_3"),
],
)
splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]
else:
raise ValueError(f"Unexpected export type: {EXPORT_TYPE}")
API 參考:MarkdownHeaderTextSplitter
檢查一些範例分割
for d in splits[:3]:
print(f"- {d.page_content=}")
print("...")
- d.page_content='arXiv:2408.09869v5 [cs.CL] 9 Dec 2024'
- d.page_content='Docling Technical Report\nVersion 1.0\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\nAI4K Group, IBM Research R¨uschlikon, Switzerland'
- d.page_content='Abstract\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'
...
擷取
import json
from pathlib import Path
from tempfile import mkdtemp
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_milvus import Milvus
embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)
milvus_uri = str(Path(mkdtemp()) / "docling.db") # or set as needed
vectorstore = Milvus.from_documents(
documents=splits,
embedding=embedding,
collection_name="docling_demo",
connection_args={"uri": milvus_uri},
index_params={"index_type": "FLAT"},
drop_old=True,
)
API 參考:HuggingFaceEmbeddings | Milvus
RAG
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_huggingface import HuggingFaceEndpoint
retriever = vectorstore.as_retriever(search_kwargs={"k": TOP_K})
llm = HuggingFaceEndpoint(
repo_id=GEN_MODEL_ID,
huggingfacehub_api_token=HF_TOKEN,
task="text-generation",
)
def clip_text(text, threshold=100):
return f"{text[:threshold]}..." if len(text) > threshold else text
question_answer_chain = create_stuff_documents_chain(llm, PROMPT)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
resp_dict = rag_chain.invoke({"input": QUESTION})
clipped_answer = clip_text(resp_dict["answer"], threshold=350)
print(f"Question:\n{resp_dict['input']}\n\nAnswer:\n{clipped_answer}")
for i, doc in enumerate(resp_dict["context"]):
print()
print(f"Source {i+1}:")
print(f" text: {json.dumps(clip_text(doc.page_content, threshold=350))}")
for key in doc.metadata:
if key != "pk":
val = doc.metadata.get(key)
clipped_val = clip_text(val) if isinstance(val, str) else val
print(f" {key}: {clipped_val}")
Question:
Which are the main AI models in Docling?
Answer:
The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model.
Source 1:
text: "3.2 AI models\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re..."
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/50', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 108.0, 't': 405.1419982910156, 'r': 504.00299072265625, 'b': 330.7799987792969, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}
source: https://arxiv.org/pdf/2408.09869
Source 2:
text: "3 Processing pipeline\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ..."
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/26', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 2, 'bbox': {'l': 108.0, 't': 273.01800537109375, 'r': 504.00299072265625, 'b': 176.83799743652344, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 796]}]}], 'headings': ['3 Processing pipeline'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}
source: https://arxiv.org/pdf/2408.09869
Source 3:
text: "6 Future work and contributions\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ..."
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/76', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 322.468994140625, 'r': 504.00299072265625, 'b': 259.0169982910156, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}, {'self_ref': '#/texts/77', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 251.6540069580078, 'r': 504.00299072265625, 'b': 198.99200439453125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 402]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}
source: https://arxiv.org/pdf/2408.09869
請注意,來源包含豐富的 grounding 資訊,包括段落標題(即章節)、頁面和精確的邊界框。