遞迴網址

RecursiveUrlLoader 可讓您從根網址遞迴抓取所有子連結，並將其剖析為文件。

總覽

整合詳細資訊

類別	套件	本機	可序列化	JS 支援
RecursiveUrlLoader	langchain_community	✅	❌	✅

載入器功能

來源	文件延遲載入	原生非同步支援
RecursiveUrlLoader	✅	❌

設定

憑證

使用 RecursiveUrlLoader 不需要憑證。

安裝

RecursiveUrlLoader 位於 langchain-community 套件中。雖然安裝 beautifulsoup4 也會獲得更豐富的預設文件元數據，但沒有其他必需的套件。

%pip install -qU langchain-community beautifulsoup4 lxml

例項化

現在我們可以例項化文件載入器物件並載入文件

from langchain_community.document_loaders import RecursiveUrlLoader

loader = RecursiveUrlLoader(
    "https://docs.python.org/3.9/",
    # max_depth=2,
    # use_async=False,
    # extractor=None,
    # metadata_extractor=None,
    # exclude_dirs=(),
    # timeout=10,
    # check_response_status=True,
    # continue_on_failure=True,
    # prevent_outside=True,
    # base_url=None,
    # ...
)

API 參考：RecursiveUrlLoader

載入

使用 .load() 同步載入記憶體中的所有文件，每個造訪的網址對應一個文件。從初始網址開始，我們會遞迴遍歷所有連結的網址，直到達到指定的 max_depth。

讓我們執行一個基本範例，說明如何在 Python 3.9 文件上使用 RecursiveUrlLoader。

docs = loader.load()
docs[0].metadata

/Users/bagatur/.pyenv/versions/3.9.1/lib/python3.9/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  k = self.parse_starttag(i)

{'source': 'https://docs.python.org/3.9/',
 'content_type': 'text/html',
 'title': '3.9.19 Documentation',
 'language': None}

太棒了！第一個文件看起來像是我們開始的根頁面。讓我們看看下一個文件的元數據

docs[1].metadata

{'source': 'https://docs.python.org/3.9/using/index.html',
 'content_type': 'text/html',
 'title': 'Python Setup and Usage — Python 3.9.19 documentation',
 'language': None}

該網址看起來像是我們根頁面的子頁面，太好了！讓我們從元數據繼續檢視其中一個文件的內容

print(docs[0].page_content[:300])

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta charset="utf-8" /><title>3.9.19 Documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0">
    
    <link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
    <link rel=

這看起來確實像是來自網址 https://docs.python.org/3.9/ 的 HTML，這符合我們的預期。現在讓我們看看我們可以對基本範例進行哪些變更，這些變更在不同情況下可能會有所幫助。

延遲載入

如果我們要載入大量文件，而且下游作業可以對所有已載入文件的子集執行，我們可以一次延遲載入一個文件，以最大程度地減少記憶體佔用空間

pages = []
for doc in loader.lazy_load():
    pages.append(doc)
    if len(pages) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        pages = []

/var/folders/4j/2rz3865x6qg07tx43146py8h0000gn/T/ipykernel_73962/2110507528.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  soup = BeautifulSoup(html, "lxml")

在這個範例中，我們一次載入到記憶體中的文件永遠不會超過 10 個。

新增擷取器

預設情況下，載入器會將每個連結的原始 HTML 設定為文件頁面內容。若要將此 HTML 剖析為更適合人類/LLM 的格式，您可以傳入自訂的 extractor 方法

import re

from bs4 import BeautifulSoup


def bs4_extractor(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    return re.sub(r"\n\n+", "\n\n", soup.text).strip()


loader = RecursiveUrlLoader("https://docs.python.org/3.9/", extractor=bs4_extractor)
docs = loader.load()
print(docs[0].page_content[:200])

/var/folders/td/vzm913rx77x21csd90g63_7c0000gn/T/ipykernel_10935/1083427287.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  soup = BeautifulSoup(html, "lxml")
/Users/isaachershenson/.pyenv/versions/3.11.9/lib/python3.11/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  k = self.parse_starttag(i)
``````output
3.9.19 Documentation

Download
Download these documents
Docs by version

Python 3.13 (in development)
Python 3.12 (stable)
Python 3.11 (security-fixes)
Python 3.10 (security-fixes)
Python 3.9 (securit

這看起來好多了！

您也可以類似地傳入 metadata_extractor，以自訂從 HTTP 回應中擷取文件元數據的方式。如需更多資訊，請參閱 API 參考。

API 參考

這些範例僅顯示您可以修改預設 RecursiveUrlLoader 的幾種方式，但還有許多其他修改可以進行，以最符合您的用例。使用參數 link_regex 和 exclude_dirs 可以幫助您篩選掉不需要的網址，aload() 和 alazy_load() 可用於非同步載入等等。

如需配置和呼叫 RecursiveUrlLoader 的詳細資訊，請參閱 API 參考：https://langchain-python.dev.org.tw/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html。

文件載入器概念指南
文件載入器操作指南

總覽​

整合詳細資訊​

載入器功能​

設定​

憑證​

安裝​

例項化​

載入​

延遲載入​

新增擷取器​

API 參考​

相關內容​

此頁面是否有幫助？

總覽

整合詳細資訊

載入器功能

設定

憑證

安裝

例項化

載入

延遲載入

新增擷取器

API 參考

相關內容