跳到主要內容
Open In ColabOpen on GitHub

如何載入網頁

本指南涵蓋如何將網頁載入到 LangChain Document 格式,我們在下游使用此格式。網頁包含文字、圖片和其他多媒體元素,通常以 HTML 表示。它們可能包含指向其他頁面或資源的連結。

LangChain 與許多適用於網頁的解析器整合。正確的解析器將取決於您的需求。下面我們示範兩種可能性

  • 簡單快速的解析,我們在其中為每個網頁恢復一個 Document,其內容表示為「扁平化」字串;
  • 進階解析,我們在其中為每個頁面恢復多個 Document 物件,允許識別和遍歷章節、連結、表格和其他結構。

設定

對於「簡單快速」解析,我們將需要 langchain-communitybeautifulsoup4 程式庫

%pip install -qU langchain-community beautifulsoup4

對於進階解析,我們將使用 langchain-unstructured

%pip install -qU langchain-unstructured

簡單快速的文字萃取

如果您正在尋找嵌入在網頁中的文字的簡單字串表示形式,則以下方法是適當的。它將傳回 Document 物件的清單 - 每個頁面一個 - 包含頁面文字的單個字串。在底層,它使用 beautifulsoup4 Python 程式庫。

LangChain 文件載入器實作 lazy_load 及其非同步變體 alazy_load,它們傳回 Document 物件的迭代器。我們將在下面使用這些。

import bs4
from langchain_community.document_loaders import WebBaseLoader

page_url = "https://langchain-python.dev.org.tw/docs/how_to/chatbots_memory/"

loader = WebBaseLoader(web_paths=[page_url])
docs = []
async for doc in loader.alazy_load():
docs.append(doc)

assert len(docs) == 1
doc = docs[0]
API 參考:WebBaseLoader
USER_AGENT environment variable not set, consider setting it to identify your requests.
print(f"{doc.metadata}\n")
print(doc.page_content[:500].strip())
{'source': 'https://langchain-python.dev.org.tw/docs/how_to/chatbots_memory/', 'title': 'How to add memory to chatbots | \uf8ffü¶úÔ∏è\uf8ffüîó LangChain', 'description': 'A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including:', 'language': 'en'}

How to add memory to chatbots | 🦜️🔗 LangChain







Skip to main contentShare your thoughts on AI agents. Take the 3-min survey.IntegrationsAPI ReferenceMoreContributingPeopleLangSmithLangGraphLangChain HubLangChain JS/TSv0.3v0.3v0.2v0.1💬SearchIntroductionTutorialsBuild a Question Answering application over a Graph DatabaseTutorialsBuild a Simple LLM Application with LCELBuild a Query Analysis SystemBuild a ChatbotConversational RAGBuild an Extraction ChainBuild an AgentTaggingd

這基本上是頁面 HTML 中的文字轉儲。它可能包含多餘的資訊,例如標題和導覽列。如果您熟悉預期的 HTML,您可以透過 BeautifulSoup 指定所需的 <div> 類別和其他參數。下面我們僅解析文章的內文

loader = WebBaseLoader(
web_paths=[page_url],
bs_kwargs={
"parse_only": bs4.SoupStrainer(class_="theme-doc-markdown markdown"),
},
bs_get_text_kwargs={"separator": " | ", "strip": True},
)

docs = []
async for doc in loader.alazy_load():
docs.append(doc)

assert len(docs) == 1
doc = docs[0]
print(f"{doc.metadata}\n")
print(doc.page_content[:500])
{'source': 'https://langchain-python.dev.org.tw/docs/how_to/chatbots_memory/'}

How to add memory to chatbots | A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including: | Simply stuffing previous messages into a chat model prompt. | The above, but trimming old messages to reduce the amount of distracting information the model has to deal with. | More complex modifications like synthesizing summaries for long running conversations. | We'll go into more detail on a few techniq
print(doc.page_content[-500:])
a greeting. Nemo then asks the AI how it is doing, and the AI responds that it is fine.'), | HumanMessage(content='What did I say my name was?'), | AIMessage(content='You introduced yourself as Nemo. How can I assist you today, Nemo?')] | Note that invoking the chain again will generate another summary generated from the initial summary plus new messages and so on. You could also design a hybrid approach where a certain number of messages are retained in chat history while others are summarized.

請注意,這需要預先具備關於內文如何在底層 HTML 中表示的技術知識。

我們可以使用各種設定參數化 WebBaseLoader,允許指定請求標頭、速率限制以及 BeautifulSoup 的解析器和其他 kwargs。有關詳細資訊,請參閱其 API 參考

進階解析

如果我們想要更精細地控制或處理頁面內容,則此方法是適當的。在下面,我們不是為每個頁面產生一個 Document 並透過 BeautifulSoup 控制其內容,而是產生多個 Document 物件,這些物件表示頁面上的不同結構。這些結構可以包括章節標題及其對應的內文、清單或枚舉、表格等等。

在底層,它使用 langchain-unstructured 程式庫。有關使用 Unstructured 與 LangChain 的更多資訊,請參閱整合文件

from langchain_unstructured import UnstructuredLoader

page_url = "https://langchain-python.dev.org.tw/docs/how_to/chatbots_memory/"
loader = UnstructuredLoader(web_url=page_url)

docs = []
async for doc in loader.alazy_load():
docs.append(doc)
API 參考:UnstructuredLoader
INFO: Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO: NumExpr defaulting to 8 threads.

請注意,在沒有預先了解頁面 HTML 結構的情況下,我們恢復了內文的自然組織

for doc in docs[:5]:
print(doc.page_content)
How to add memory to chatbots
A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including:
Simply stuffing previous messages into a chat model prompt.
The above, but trimming old messages to reduce the amount of distracting information the model has to deal with.
More complex modifications like synthesizing summaries for long running conversations.
ERROR! Session/line number was not unique in database. History logging moved to new session 2747

從特定章節萃取內容

每個 Document 物件都代表頁面的一個元素。其中繼資料包含有用的資訊,例如其類別

for doc in docs[:5]:
print(f'{doc.metadata["category"]}: {doc.page_content}')
Title: How to add memory to chatbots
NarrativeText: A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including:
ListItem: Simply stuffing previous messages into a chat model prompt.
ListItem: The above, but trimming old messages to reduce the amount of distracting information the model has to deal with.
ListItem: More complex modifications like synthesizing summaries for long running conversations.

元素也可能具有父子關係 - 例如,段落可能屬於具有標題的章節。如果某個章節特別感興趣(例如,用於索引),我們可以隔離對應的 Document 物件。

作為一個範例,下面我們載入兩個網頁的「設定」章節的內容

from typing import List

from langchain_core.documents import Document


async def _get_setup_docs_from_url(url: str) -> List[Document]:
loader = UnstructuredLoader(web_url=url)

setup_docs = []
parent_id = -1
async for doc in loader.alazy_load():
if doc.metadata["category"] == "Title" and doc.page_content.startswith("Setup"):
parent_id = doc.metadata["element_id"]
if doc.metadata.get("parent_id") == parent_id:
setup_docs.append(doc)

return setup_docs


page_urls = [
"https://langchain-python.dev.org.tw/docs/how_to/chatbots_memory/",
"https://langchain-python.dev.org.tw/docs/how_to/chatbots_tools/",
]
setup_docs = []
for url in page_urls:
page_setup_docs = await _get_setup_docs_from_url(url)
setup_docs.extend(page_setup_docs)
API 參考:Document
from collections import defaultdict

setup_text = defaultdict(str)

for doc in setup_docs:
url = doc.metadata["url"]
setup_text[url] += f"{doc.page_content}\n"

dict(setup_text)
{'https://langchain-python.dev.org.tw/docs/how_to/chatbots_memory/': "You'll need to install a few packages, and have your OpenAI API key set as an environment variable named OPENAI_API_KEY:\n%pip install --upgrade --quiet langchain langchain-openai\n\n# Set env var OPENAI_API_KEY or load from a .env file:\nimport dotenv\n\ndotenv.load_dotenv()\n[33mWARNING: You are using pip version 22.0.4; however, version 23.3.2 is available.\nYou should consider upgrading via the '/Users/jacoblee/.pyenv/versions/3.10.5/bin/python -m pip install --upgrade pip' command.[0m[33m\n[0mNote: you may need to restart the kernel to use updated packages.\n",
'https://langchain-python.dev.org.tw/docs/how_to/chatbots_tools/': "For this guide, we'll be using a tool calling agent with a single tool for searching the web. The default will be powered by Tavily, but you can switch it out for any similar tool. The rest of this section will assume you're using Tavily.\nYou'll need to sign up for an account on the Tavily website, and install the following packages:\n%pip install --upgrade --quiet langchain-community langchain-openai tavily-python\n\n# Set env var OPENAI_API_KEY or load from a .env file:\nimport dotenv\n\ndotenv.load_dotenv()\nYou will also need your OpenAI key set as OPENAI_API_KEY and your Tavily API key set as TAVILY_API_KEY.\n"}

在頁面內容上進行向量搜尋

一旦我們將頁面內容載入到 LangChain Document 物件中,我們就可以像往常一樣對它們進行索引(例如,對於 RAG 應用程式)。下面我們使用 OpenAI 嵌入,儘管任何 LangChain 嵌入模型都足夠。

%pip install -qU langchain-openai
import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings

vector_store = InMemoryVectorStore.from_documents(setup_docs, OpenAIEmbeddings())
retrieved_docs = vector_store.similarity_search("Install Tavily", k=2)
for doc in retrieved_docs:
print(f'Page {doc.metadata["url"]}: {doc.page_content[:300]}\n')
INFO: HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
``````output
Page https://langchain-python.dev.org.tw/docs/how_to/chatbots_tools/: You'll need to sign up for an account on the Tavily website, and install the following packages:

Page https://langchain-python.dev.org.tw/docs/how_to/chatbots_tools/: For this guide, we'll be using a tool calling agent with a single tool for searching the web. The default will be powered by Tavily, but you can switch it out for any similar tool. The rest of this section will assume you're using Tavily.

其他網頁載入器

如需可用的 LangChain 網頁載入器清單,請參閱此表格


此頁面是否有幫助?