抓取動作是並行完成的。並行請求有合理的限制，預設為每秒 2 個。如果您不擔心成為好公民，或者您控制了抓取的伺服器，或者不在乎負載，則可以提高此限制。請注意，雖然這會加快抓取過程，但可能會導致伺服器封鎖您。請小心！

總覽

整合細節

類別	套件	本地	可序列化	JS 支援
SiteMapLoader	langchain_community	✅	❌	✅

載入器功能

來源	文件延遲載入	原生非同步支援
SiteMapLoader	✅	❌

設定

若要存取 SiteMap 文件載入器，您需要安裝 langchain-community 整合套件。

憑證

執行此操作不需要憑證。

如果您想要取得模型呼叫的自動化一流追蹤，您也可以設定您的 LangSmith API 金鑰，方法是取消註解下方的程式碼

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

安裝

安裝 langchain_community。

%pip install -qU langchain-community

修正 notebook asyncio 錯誤

import nest_asyncio

nest_asyncio.apply()

初始化

現在我們可以實例化我們的模型物件並載入文件

from langchain_community.document_loaders.sitemap import SitemapLoader

API 參考文件：SitemapLoader

sitemap_loader = SitemapLoader(web_path="https://api.python.langchain.com/sitemap.xml")

載入

docs = sitemap_loader.load()
docs[0]

Fetching pages: 100%|##########| 28/28 [00:04<00:00,  6.83it/s]

Document(metadata={'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-05-15T00:29:42.163001+00:00', 'changefreq': 'weekly', 'priority': '1'}, page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n')

print(docs[0].metadata)

{'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-05-15T00:29:42.163001+00:00', 'changefreq': 'weekly', 'priority': '1'}

您可以變更 requests_per_second 參數以增加最大並行請求數。並使用 requests_kwargs 在傳送請求時傳遞 kwargs。

sitemap_loader.requests_per_second = 2
# Optional: avoid `[SSL: CERTIFICATE_VERIFY_FAILED]` issue
sitemap_loader.requests_kwargs = {"verify": False}

延遲載入

您也可以延遲載入頁面，以盡量減少記憶體負載。

page = []
for doc in sitemap_loader.lazy_load():
    page.append(doc)
    if len(page) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        page = []

Fetching pages: 100%|##########| 28/28 [00:01<00:00, 19.06it/s]

過濾網站地圖 URL

網站地圖可能是包含數千個 URL 的大型檔案。通常您不需要其中的每一個。您可以透過將字串或正則表達式模式清單傳遞至 filter_urls 參數來過濾 URL。只會載入符合其中一種模式的 URL。

loader = SitemapLoader(
    web_path="https://api.python.langchain.com/sitemap.xml",
    filter_urls=["https://api.python.langchain.com/en/latest"],
)
documents = loader.load()

documents[0]

Document(page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n', metadata={'source': 'https://api.python.langchain.com/en/latest/', 'loc': 'https://api.python.langchain.com/en/latest/', 'lastmod': '2024-02-12T05:26:10.971077+00:00', 'changefreq': 'daily', 'priority': '0.9'})

新增自訂抓取規則

SitemapLoader 使用 beautifulsoup4 進行抓取過程，預設情況下它會抓取頁面上的每個元素。SitemapLoader 建構函式接受自訂抓取函式。此功能可能有助於根據您的特定需求調整抓取過程；例如，您可能想要避免抓取標頭或導覽元素。

以下範例示範如何開發和使用自訂函式來避免導覽和標頭元素。

匯入 beautifulsoup4 程式庫並定義自訂函式。

pip install beautifulsoup4

from bs4 import BeautifulSoup


def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
    # Find all 'nav' and 'header' elements in the BeautifulSoup object
    nav_elements = content.find_all("nav")
    header_elements = content.find_all("header")

    # Remove each 'nav' and 'header' element from the BeautifulSoup object
    for element in nav_elements + header_elements:
        element.decompose()

    return str(content.get_text())

將您的自訂函式新增至 SitemapLoader 物件。

loader = SitemapLoader(
    "https://api.python.langchain.com/sitemap.xml",
    filter_urls=["https://api.python.langchain.com/en/latest/"],
    parsing_function=remove_nav_and_header_elements,
)

本地網站地圖

網站地圖載入器也可以用於載入本地檔案。

sitemap_loader = SitemapLoader(web_path="example_data/sitemap.xml", is_local=True)

docs = sitemap_loader.load()

API 參考文件

如需所有 SiteMapLoader 功能和組態的詳細文件，請前往 API 參考文件： https://langchain-python.dev.org.tw/api_reference/community/document_loaders/langchain_community.document_loaders.sitemap.SitemapLoader.html#langchain_community.document_loaders.sitemap.SitemapLoader

文件載入器概念指南
文件載入器操作指南

總覽​

整合細節​

載入器功能​

設定​

憑證​

安裝​

修正 notebook asyncio 錯誤​

初始化​

載入​

延遲載入​

過濾網站地圖 URL​

新增自訂抓取規則​

本地網站地圖​

API 參考文件​

相關​

此頁面是否有幫助？

總覽