跳到主要內容
Open In ColabOpen on GitHub

遞迴網址

RecursiveUrlLoader 可讓您從根網址遞迴抓取所有子連結,並將其剖析為文件。

總覽

整合詳細資訊

類別套件本機可序列化JS 支援
RecursiveUrlLoaderlangchain_community

載入器功能

來源文件延遲載入原生非同步支援
RecursiveUrlLoader

設定

憑證

使用 RecursiveUrlLoader 不需要憑證。

安裝

RecursiveUrlLoader 位於 langchain-community 套件中。雖然安裝 beautifulsoup4 也會獲得更豐富的預設文件元數據,但沒有其他必需的套件。

%pip install -qU langchain-community beautifulsoup4 lxml

例項化

現在我們可以例項化文件載入器物件並載入文件

from langchain_community.document_loaders import RecursiveUrlLoader

loader = RecursiveUrlLoader(
"https://docs.python.org/3.9/",
# max_depth=2,
# use_async=False,
# extractor=None,
# metadata_extractor=None,
# exclude_dirs=(),
# timeout=10,
# check_response_status=True,
# continue_on_failure=True,
# prevent_outside=True,
# base_url=None,
# ...
)
API 參考:RecursiveUrlLoader

載入

使用 .load() 同步載入記憶體中的所有文件,每個造訪的網址對應一個文件。從初始網址開始,我們會遞迴遍歷所有連結的網址,直到達到指定的 max_depth。

讓我們執行一個基本範例,說明如何在 Python 3.9 文件上使用 RecursiveUrlLoader

docs = loader.load()
docs[0].metadata
/Users/bagatur/.pyenv/versions/3.9.1/lib/python3.9/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
k = self.parse_starttag(i)
{'source': 'https://docs.python.org/3.9/',
'content_type': 'text/html',
'title': '3.9.19 Documentation',
'language': None}

太棒了!第一個文件看起來像是我們開始的根頁面。讓我們看看下一個文件的元數據

docs[1].metadata
{'source': 'https://docs.python.org/3.9/using/index.html',
'content_type': 'text/html',
'title': 'Python Setup and Usage — Python 3.9.19 documentation',
'language': None}

該網址看起來像是我們根頁面的子頁面,太好了!讓我們從元數據繼續檢視其中一個文件的內容

print(docs[0].page_content[:300])

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" /><title>3.9.19 Documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0">

<link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
<link rel=

這看起來確實像是來自網址 https://docs.python.org/3.9/ 的 HTML,這符合我們的預期。現在讓我們看看我們可以對基本範例進行哪些變更,這些變更在不同情況下可能會有所幫助。

延遲載入

如果我們要載入大量文件,而且下游作業可以對所有已載入文件的子集執行,我們可以一次延遲載入一個文件,以最大程度地減少記憶體佔用空間

pages = []
for doc in loader.lazy_load():
pages.append(doc)
if len(pages) >= 10:
# do some paged operation, e.g.
# index.upsert(page)

pages = []
/var/folders/4j/2rz3865x6qg07tx43146py8h0000gn/T/ipykernel_73962/2110507528.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
soup = BeautifulSoup(html, "lxml")

在這個範例中,我們一次載入到記憶體中的文件永遠不會超過 10 個。

新增擷取器

預設情況下,載入器會將每個連結的原始 HTML 設定為文件頁面內容。若要將此 HTML 剖析為更適合人類/LLM 的格式,您可以傳入自訂的 extractor 方法

import re

from bs4 import BeautifulSoup


def bs4_extractor(html: str) -> str:
soup = BeautifulSoup(html, "lxml")
return re.sub(r"\n\n+", "\n\n", soup.text).strip()


loader = RecursiveUrlLoader("https://docs.python.org/3.9/", extractor=bs4_extractor)
docs = loader.load()
print(docs[0].page_content[:200])
/var/folders/td/vzm913rx77x21csd90g63_7c0000gn/T/ipykernel_10935/1083427287.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
soup = BeautifulSoup(html, "lxml")
/Users/isaachershenson/.pyenv/versions/3.11.9/lib/python3.11/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
k = self.parse_starttag(i)
``````output
3.9.19 Documentation

Download
Download these documents
Docs by version

Python 3.13 (in development)
Python 3.12 (stable)
Python 3.11 (security-fixes)
Python 3.10 (security-fixes)
Python 3.9 (securit

這看起來好多了!

您也可以類似地傳入 metadata_extractor,以自訂從 HTTP 回應中擷取文件元數據的方式。如需更多資訊,請參閱 API 參考

API 參考

這些範例僅顯示您可以修改預設 RecursiveUrlLoader 的幾種方式,但還有許多其他修改可以進行,以最符合您的用例。使用參數 link_regexexclude_dirs 可以幫助您篩選掉不需要的網址,aload()alazy_load() 可用於非同步載入等等。

如需配置和呼叫 RecursiveUrlLoader 的詳細資訊,請參閱 API 參考:https://langchain-python.dev.org.tw/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html


此頁面是否有幫助?