如何建立自訂文件載入器

概述

基於 LLM 的應用程式通常需要從資料庫或檔案（如 PDF）中提取資料，並將其轉換為 LLM 可以利用的格式。在 LangChain 中，這通常涉及建立 Document 物件，這些物件封裝提取的文本 (page_content) 以及元數據——一個包含文件詳細資訊的字典，例如作者姓名或出版日期。

Document 物件通常被格式化為提示，這些提示被饋送到 LLM 中，允許 LLM 使用 Document 中的資訊來生成所需的回應（例如，摘要文件）。Documents 可以立即使用，也可以索引到向量儲存中以供未來檢索和使用。

文件載入的主要抽象概念是

組件	描述
Document	包含 `text` 和 `metadata`
BaseLoader	用於將原始資料轉換為 `Documents`
Blob	二進制資料的表示形式，位於檔案中或記憶體中
BaseBlobParser	將 `Blob` 解析為產生 `Document` 物件的邏輯

本指南將示範如何編寫自訂文件載入和檔案解析邏輯；具體來說，我們將了解如何

透過子類化 BaseLoader 建立標準文件載入器。
使用 BaseBlobParser 建立解析器，並將其與 Blob 和 BlobLoaders 結合使用。這主要在處理檔案時很有用。

標準文件載入器

文件載入器可以透過子類化 BaseLoader 來實現，它為載入文件提供了標準介面。

介面

方法名稱	說明
lazy_load	用於惰性地逐個載入文件。用於生產程式碼。
alazy_load	`lazy_load` 的非同步變體
load	用於迫切地將所有文件載入記憶體。用於原型設計或互動式工作。
aload	用於迫切地將所有文件載入記憶體。用於原型設計或互動式工作。於 2024-04 添加到 LangChain。

load 方法是一種僅用於原型設計工作的便捷方法——它只是調用 list(self.lazy_load())。
alazy_load 具有預設實現，它將委派給 lazy_load。如果您正在使用非同步，我們建議覆蓋預設實現並提供原生非同步實現。

重要提示

在實作文件載入器時，請勿透過 lazy_load 或 alazy_load 方法提供參數。

所有配置都預期透過初始化器 (init) 傳遞。這是 LangChain 做出的一個設計選擇，以確保文件載入器一旦被實例化，它就擁有載入文件所需的所有資訊。

實作

讓我們建立一個標準文件載入器的範例，該載入器載入檔案並從檔案中的每一行建立一個文件。

from typing import AsyncIterator, Iterator

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document


class CustomDocumentLoader(BaseLoader):
    """An example document loader that reads a file line by line."""

    def __init__(self, file_path: str) -> None:
        """Initialize the loader with a file path.

        Args:
            file_path: The path to the file to load.
        """
        self.file_path = file_path

    def lazy_load(self) -> Iterator[Document]:  # <-- Does not take any arguments
        """A lazy loader that reads a file line by line.

        When you're implementing lazy load methods, you should use a generator
        to yield documents one by one.
        """
        with open(self.file_path, encoding="utf-8") as f:
            line_number = 0
            for line in f:
                yield Document(
                    page_content=line,
                    metadata={"line_number": line_number, "source": self.file_path},
                )
                line_number += 1

    # alazy_load is OPTIONAL.
    # If you leave out the implementation, a default implementation which delegates to lazy_load will be used!
    async def alazy_load(
        self,
    ) -> AsyncIterator[Document]:  # <-- Does not take any arguments
        """An async lazy loader that reads a file line by line."""
        # Requires aiofiles (install with pip)
        # https://github.com/Tinche/aiofiles
        import aiofiles

        async with aiofiles.open(self.file_path, encoding="utf-8") as f:
            line_number = 0
            async for line in f:
                yield Document(
                    page_content=line,
                    metadata={"line_number": line_number, "source": self.file_path},
                )
                line_number += 1

API 參考：BaseLoader | Document

測試 🧪

為了測試文件載入器，我們需要一個包含一些優質內容的檔案。

with open("./meow.txt", "w", encoding="utf-8") as f:
    quality_content = "meow meow🐱 \n meow meow🐱 \n meow😻😻"
    f.write(quality_content)

loader = CustomDocumentLoader("./meow.txt")

## Test out the lazy load interface
for doc in loader.lazy_load():
    print()
    print(type(doc))
    print(doc)

<class 'langchain_core.documents.base.Document'>
page_content='meow meow🐱 \n' metadata={'line_number': 0, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow meow🐱 \n' metadata={'line_number': 1, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow😻😻' metadata={'line_number': 2, 'source': './meow.txt'}

## Test out the async implementation
async for doc in loader.alazy_load():
    print()
    print(type(doc))
    print(doc)

<class 'langchain_core.documents.base.Document'>
page_content='meow meow🐱 \n' metadata={'line_number': 0, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow meow🐱 \n' metadata={'line_number': 1, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow😻😻' metadata={'line_number': 2, 'source': './meow.txt'}

提示

load() 在 Jupyter Notebook 等互動式環境中可能很有幫助。

避免將其用於生產程式碼，因為迫切載入假設所有內容都可以放入記憶體，但情況並非總是如此，尤其是對於企業資料而言。

loader.load()

[Document(page_content='meow meow🐱 \n', metadata={'line_number': 0, 'source': './meow.txt'}),
 Document(page_content=' meow meow🐱 \n', metadata={'line_number': 1, 'source': './meow.txt'}),
 Document(page_content=' meow😻😻', metadata={'line_number': 2, 'source': './meow.txt'})]

處理檔案

許多文件載入器都涉及解析檔案。此類載入器之間的差異通常源於檔案的解析方式，而不是檔案的載入方式。例如，您可以使用 open 來讀取 PDF 或 Markdown 檔案的二進制內容，但您需要不同的解析邏輯才能將該二進制資料轉換為文本。

因此，將解析邏輯與載入邏輯分離可能會很有幫助，這使得無論資料如何載入，都可以更輕鬆地重複使用給定的解析器。

BaseBlobParser

BaseBlobParser 是一個介面，它接受 blob 並輸出 Document 物件的列表。blob 是資料的表示形式，資料位於記憶體中或檔案中。LangChain python 有一個 Blob 原始物件，其靈感來自 Blob WebAPI 規範。

from langchain_core.document_loaders import BaseBlobParser, Blob


class MyParser(BaseBlobParser):
    """A simple parser that creates a document from each line."""

    def lazy_parse(self, blob: Blob) -> Iterator[Document]:
        """Parse a blob into a document line by line."""
        line_number = 0
        with blob.as_bytes_io() as f:
            for line in f:
                line_number += 1
                yield Document(
                    page_content=line,
                    metadata={"line_number": line_number, "source": blob.source},
                )

API 參考：BaseBlobParser | Blob

blob = Blob.from_path("./meow.txt")
parser = MyParser()

list(parser.lazy_parse(blob))

[Document(page_content='meow meow🐱 \n', metadata={'line_number': 1, 'source': './meow.txt'}),
 Document(page_content=' meow meow🐱 \n', metadata={'line_number': 2, 'source': './meow.txt'}),
 Document(page_content=' meow😻😻', metadata={'line_number': 3, 'source': './meow.txt'})]

使用 blob API 還允許直接從記憶體載入內容，而無需從檔案中讀取！

blob = Blob(data=b"some data from memory\nmeow")
list(parser.lazy_parse(blob))

[Document(page_content='some data from memory\n', metadata={'line_number': 1, 'source': None}),
 Document(page_content='meow', metadata={'line_number': 2, 'source': None})]

Blob

讓我們快速瀏覽一下 Blob API 的一些內容。

blob = Blob.from_path("./meow.txt", metadata={"foo": "bar"})

blob.encoding

'utf-8'

blob.as_bytes()

b'meow meow\xf0\x9f\x90\xb1 \n meow meow\xf0\x9f\x90\xb1 \n meow\xf0\x9f\x98\xbb\xf0\x9f\x98\xbb'

blob.as_string()

'meow meow🐱 \n meow meow🐱 \n meow😻😻'

blob.as_bytes_io()

<contextlib._GeneratorContextManager at 0x743f34324450>

blob.metadata

{'foo': 'bar'}

blob.source

'./meow.txt'

Blob 載入器

雖然解析器封裝了將二進制資料解析為文件所需的邏輯，但blob 載入器封裝了從給定的儲存位置載入 blob 所需的邏輯。

目前，LangChain 僅支援 FileSystemBlobLoader。

您可以使用 FileSystemBlobLoader 載入 blob，然後使用解析器解析它們。

from langchain_community.document_loaders.blob_loaders import FileSystemBlobLoader

blob_loader = FileSystemBlobLoader(path=".", glob="*.mdx", show_progress=True)

API 參考：FileSystemBlobLoader

parser = MyParser()
for blob in blob_loader.yield_blobs():
    for doc in parser.lazy_parse(blob):
        print(doc)
        break

  0%|          | 0/8 [00:00<?, ?it/s]

page_content='# Microsoft Office\n' metadata={'line_number': 1, 'source': 'office_file.mdx'}
page_content='# Markdown\n' metadata={'line_number': 1, 'source': 'markdown.mdx'}
page_content='# JSON\n' metadata={'line_number': 1, 'source': 'json.mdx'}
page_content='---\n' metadata={'line_number': 1, 'source': 'pdf.mdx'}
page_content='---\n' metadata={'line_number': 1, 'source': 'index.mdx'}
page_content='# File Directory\n' metadata={'line_number': 1, 'source': 'file_directory.mdx'}
page_content='# CSV\n' metadata={'line_number': 1, 'source': 'csv.mdx'}
page_content='# HTML\n' metadata={'line_number': 1, 'source': 'html.mdx'}

通用載入器

LangChain 具有 GenericLoader 抽象概念，它將 BlobLoader 與 BaseBlobParser 組合在一起。

GenericLoader 旨在提供標準化的類別方法，以便輕鬆使用現有的 BlobLoader 實作。目前，僅支援 FileSystemBlobLoader。

from langchain_community.document_loaders.generic import GenericLoader

loader = GenericLoader.from_filesystem(
    path=".", glob="*.mdx", show_progress=True, parser=MyParser()
)

for idx, doc in enumerate(loader.lazy_load()):
    if idx < 5:
        print(doc)

print("... output truncated for demo purposes")

API 參考：GenericLoader

  0%|          | 0/8 [00:00<?, ?it/s]

page_content='# Microsoft Office\n' metadata={'line_number': 1, 'source': 'office_file.mdx'}
page_content='\n' metadata={'line_number': 2, 'source': 'office_file.mdx'}
page_content='>[The Microsoft Office](https://www.office.com/) suite of productivity software includes Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Microsoft Outlook, and Microsoft OneNote. It is available for Microsoft Windows and macOS operating systems. It is also available on Android and iOS.\n' metadata={'line_number': 3, 'source': 'office_file.mdx'}
page_content='\n' metadata={'line_number': 4, 'source': 'office_file.mdx'}
page_content='This covers how to load commonly used file formats including `DOCX`, `XLSX` and `PPTX` documents into a document format that we can use downstream.\n' metadata={'line_number': 5, 'source': 'office_file.mdx'}
... output truncated for demo purposes

自訂通用載入器

如果您真的喜歡建立類別，則可以子類化並建立一個類別來封裝邏輯。

您可以從此類別子類化，以使用現有的載入器載入內容。

from typing import Any


class MyCustomLoader(GenericLoader):
    @staticmethod
    def get_parser(**kwargs: Any) -> BaseBlobParser:
        """Override this method to associate a default parser with the class."""
        return MyParser()

loader = MyCustomLoader.from_filesystem(path=".", glob="*.mdx", show_progress=True)

for idx, doc in enumerate(loader.lazy_load()):
    if idx < 5:
        print(doc)

print("... output truncated for demo purposes")

  0%|          | 0/8 [00:00<?, ?it/s]

page_content='# Microsoft Office\n' metadata={'line_number': 1, 'source': 'office_file.mdx'}
page_content='\n' metadata={'line_number': 2, 'source': 'office_file.mdx'}
page_content='>[The Microsoft Office](https://www.office.com/) suite of productivity software includes Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Microsoft Outlook, and Microsoft OneNote. It is available for Microsoft Windows and macOS operating systems. It is also available on Android and iOS.\n' metadata={'line_number': 3, 'source': 'office_file.mdx'}
page_content='\n' metadata={'line_number': 4, 'source': 'office_file.mdx'}
page_content='This covers how to load commonly used file formats including `DOCX`, `XLSX` and `PPTX` documents into a document format that we can use downstream.\n' metadata={'line_number': 5, 'source': 'office_file.mdx'}
... output truncated for demo purposes

概述​

標準文件載入器​

介面​

實作​

測試 🧪​

處理檔案​

BaseBlobParser​

Blob​

Blob 載入器​

通用載入器​

自訂通用載入器​

此頁面是否有幫助？

概述