Async Chromium
Chromium 是 Playwright 支援的瀏覽器之一,Playwright 是一個用於控制瀏覽器自動化的函式庫。
透過執行 p.chromium.launch(headless=True)
,我們正在啟動一個 Chromium 的無頭實例。
無頭模式意味著瀏覽器在沒有圖形使用者介面的情況下運行。
在下面的範例中,我們將使用 AsyncChromiumLoader
來載入頁面,然後使用 Html2TextTransformer
來去除 HTML 標籤和其他語意資訊。
%pip install --upgrade --quiet playwright beautifulsoup4 html2text
!playwright install
注意:如果您正在使用 Jupyter notebooks,您可能還需要在載入文件之前安裝並應用 nest_asyncio
,如下所示
!pip install nest-asyncio
import nest_asyncio
nest_asyncio.apply()
from langchain_community.document_loaders import AsyncChromiumLoader
urls = ["https://docs.smith.langchain.com/"]
loader = AsyncChromiumLoader(urls, user_agent="MyAppUserAgent")
docs = loader.load()
docs[0].page_content[0:100]
API 參考:AsyncChromiumLoader
'<!DOCTYPE html><html lang="en" dir="ltr" class="docs-wrapper docs-doc-page docs-version-2.0 plugin-d'
現在讓我們使用轉換器將文件轉換為更易於閱讀的語法
from langchain_community.document_transformers import Html2TextTransformer
html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)
docs_transformed[0].page_content[0:500]
API 參考:Html2TextTransformer
'Skip to main content\n\nGo to API Docs\n\nSearch`⌘``K`\n\nGo to App\n\n * Quick start\n * Tutorials\n\n * How-to guides\n\n * Concepts\n\n * Reference\n\n * Pricing\n * Self-hosting\n\n * LangGraph Cloud\n\n * * Quick start\n\nOn this page\n\n# Get started with LangSmith\n\n**LangSmith** is a platform for building production-grade LLM applications. It\nallows you to closely monitor and evaluate your application, so you can ship\nquickly and with confidence. Use of LangChain is not necessary - LangSmith\nworks on it'