WebBaseLoader
本節介紹如何使用 WebBaseLoader
從 HTML
網頁載入所有文字到我們可以下游使用的文件格式。如需載入網頁的更多自訂邏輯,請查看一些子類別範例,例如 IMSDbLoader
、AZLyricsLoader
和 CollegeConfidentialLoader
。
如果您不想擔心網站爬取、繞過 JS 封鎖網站和資料清理,請考慮使用 FireCrawlLoader
或更快的選項 SpiderLoader
。
總覽
整合詳細資訊
- 待辦事項:填寫表格功能。
- 待辦事項:如果與 JS 支援連結無關,請移除該連結,否則請確保連結正確。
- 待辦事項:確保 API 參考連結正確。
類別 | 套件 | 本地 | 可序列化 | JS 支援 |
---|---|---|---|---|
WebBaseLoader | langchain_community | ✅ | ❌ | ❌ |
載入器功能
來源 | 文件延遲載入 | 原生非同步支援 |
---|---|---|
WebBaseLoader | ✅ | ✅ |
設定
憑證
WebBaseLoader
不需要任何憑證。
安裝
若要使用 WebBaseLoader
,您首先需要安裝 langchain-community
Python 套件。
%pip install -qU langchain_community beautifulsoup4
初始化
現在我們可以實例化模型物件並載入文件
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://www.example.com/")
若要繞過擷取期間的 SSL 驗證錯誤,您可以設定 "verify" 選項
loader.requests_kwargs = {'verify':False}
使用多個頁面初始化
您也可以傳入頁面清單以從中載入。
loader_multiple_pages = WebBaseLoader(
["https://www.example.com/", "https://google.com"]
)
載入
docs = loader.load()
docs[0]
Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n')
print(docs[0].metadata)
{'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}
同時載入多個 URL
您可以透過同時抓取和剖析多個 URL 來加速抓取程序。
並行請求的數量有合理的限制,預設為每秒 2 個。如果您不擔心成為好公民,或者您控制您正在抓取的伺服器並且不關心負載,您可以變更 requests_per_second
參數以增加最大並行請求數。請注意,雖然這會加速抓取程序,但可能會導致伺服器封鎖您。請小心!
%pip install -qU nest_asyncio
# fixes a bug with asyncio and jupyter
import nest_asyncio
nest_asyncio.apply()
Note: you may need to restart the kernel to use updated packages.
loader = WebBaseLoader(["https://www.example.com/", "https://google.com"])
loader.requests_per_second = 1
docs = loader.aload()
docs
Fetching pages: 100%|###########################################################################| 2/2 [00:00<00:00, 8.28it/s]
[Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n'),
Document(metadata={'source': 'https://google.com', 'title': 'Google', 'description': "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.", 'language': 'en'}, page_content='GoogleSearch Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in\xa0Advanced search5 ways Gemini can help during the HolidaysAdvertisingBusiness SolutionsAbout Google© 2024 - Privacy - Terms ')]
載入 xml 檔案,或使用不同的 BeautifulSoup 剖析器
您也可以查看 SitemapLoader
,以了解如何載入 Sitemap 檔案的範例,這是使用此功能的範例。
loader = WebBaseLoader(
"https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml"
)
loader.default_parser = "xml"
docs = loader.load()
docs
[Document(metadata={'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}, page_content='\n\n10\nEnergy\n3\n2018-01-01\n2018-01-01\nfalse\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\n§ 431.86\nSection § 431.86\n\nEnergy\nDEPARTMENT OF ENERGY\nENERGY CONSERVATION\nENERGY EFFICIENCY PROGRAM FOR CERTAIN COMMERCIAL AND INDUSTRIAL EQUIPMENT\nCommercial Packaged Boilers\nTest Procedures\n\n\n\n\n§\u2009431.86\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\n(a) Scope. This section provides test procedures, pursuant to the Energy Policy and Conservation Act (EPCA), as amended, which must be followed for measuring the combustion efficiency and/or thermal efficiency of a gas- or oil-fired commercial packaged boiler.\n(b) Testing and Calculations. Determine the thermal efficiency or combustion efficiency of commercial packaged boilers by conducting the appropriate test procedure(s) indicated in Table 1 of this section.\n\nTable 1—Test Requirements for Commercial Packaged Boiler Equipment Classes\n\nEquipment category\nSubcategory\nCertified rated inputBtu/h\n\nStandards efficiency metric(§\u2009431.87)\n\nTest procedure(corresponding to\nstandards efficiency\nmetric required\nby §\u2009431.87)\n\n\n\nHot Water\nGas-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nHot Water\nGas-fired\n>2,500,000\nCombustion Efficiency\nAppendix A, Section 3.\n\n\nHot Water\nOil-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nHot Water\nOil-fired\n>2,500,000\nCombustion Efficiency\nAppendix A, Section 3.\n\n\nSteam\nGas-fired (all*)\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nSteam\nGas-fired (all*)\n>2,500,000 and ≤5,000,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\n\u2003\n\n>5,000,000\nThermal Efficiency\nAppendix A, Section 2.OR\nAppendix A, Section 3 with Section 2.4.3.2.\n\n\n\nSteam\nOil-fired\n≥300,000 and ≤2,500,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\nSteam\nOil-fired\n>2,500,000 and ≤5,000,000\nThermal Efficiency\nAppendix A, Section 2.\n\n\n\u2003\n\n>5,000,000\nThermal Efficiency\nAppendix A, Section 2.OR\nAppendix A, Section 3. with Section 2.4.3.2.\n\n\n\n*\u2009Equipment classes for commercial packaged boilers as of July 22, 2009 (74 FR 36355) distinguish between gas-fired natural draft and all other gas-fired (except natural draft).\n\n(c) Field Tests. The field test provisions of appendix A may be used only to test a unit of commercial packaged boiler with rated input greater than 5,000,000 Btu/h.\n[81 FR 89305, Dec. 9, 2016]\n\n\nEnergy Efficiency Standards\n\n')]
延遲載入
您可以使用延遲載入來一次僅載入一個頁面,以盡量減少記憶體需求。
pages = []
for doc in loader.lazy_load():
pages.append(doc)
print(pages[0].page_content[:100])
print(pages[0].metadata)
10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}
非同步
pages = []
async for doc in loader.alazy_load():
pages.append(doc)
print(pages[0].page_content[:100])
print(pages[0].metadata)
Fetching pages: 100%|###########################################################################| 1/1 [00:00<00:00, 10.51it/s]
``````output
10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}
使用 Proxy
有時您可能需要使用 Proxy 來繞過 IP 封鎖。您可以將 Proxy 字典傳遞到載入器(以及底層的 requests
)以使用它們。
loader = WebBaseLoader(
"https://www.walmart.com/search?q=parrots",
proxies={
"http": "http://{username}:{password}:@proxy.service.com:6666/",
"https": "https://{username}:{password}:@proxy.service.com:6666/",
},
)
docs = loader.load()
API 參考
如需所有 WebBaseLoader
功能和組態的詳細文件,請前往 API 參考:https://langchain-python.dev.org.tw/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html