跳至主要內容

Diffbot

Diffbot 是一套基於 ML 的產品,可以輕鬆地結構化網路數據。

Diffbot 的 Extract API 是一項服務,用於結構化和正規化網頁中的數據。

與傳統的網路爬蟲工具不同,Diffbot Extract 不需要任何規則來讀取頁面上的內容。它使用電腦視覺模型將頁面分類為 20 種可能的類型之一,然後將原始 HTML 標記轉換為 JSON。產生的結構化 JSON 遵循一致的 基於類型的本體,這使得使用相同的模式從多個不同的網路來源提取數據變得容易。

Open In Colab

總覽

本指南涵蓋如何使用 Diffbot Extract API 從 URL 列表提取數據到結構化的 JSON 中,以便我們可以在下游使用。

設定

首先安裝所需的套件。

%pip install --upgrade --quiet langchain-community

Diffbot 的 Extract API 需要一個 API Token。請按照這些說明取得免費的 API Token,然後設定一個環境變數。

%env DIFFBOT_API_TOKEN REPLACE_WITH_YOUR_TOKEN

使用文檔載入器

匯入 DiffbotLoader 模組,並使用 URL 列表和您的 Diffbot Token 實例化它。

import os

from langchain_community.document_loaders import DiffbotLoader

urls = [
"https://langchain-python.dev.org.tw/",
]

loader = DiffbotLoader(urls=urls, api_token=os.environ.get("DIFFBOT_API_TOKEN"))
API 參考:DiffbotLoader

使用 .load() 方法,您可以查看已載入的文件

loader.load()
[Document(page_content="LangChain is a framework for developing applications powered by large language models (LLMs).\nLangChain simplifies every stage of the LLM application lifecycle:\nDevelopment: Build your applications using LangChain's open-source building blocks and components. Hit the ground running using third-party integrations and Templates.\nProductionization: Use LangSmith to inspect, monitor and evaluate your chains, so that you can continuously optimize and deploy with confidence.\nDeployment: Turn any chain into an API with LangServe.\nlangchain-core: Base abstractions and LangChain Expression Language.\nlangchain-community: Third party integrations.\nPartner packages (e.g. langchain-openai, langchain-anthropic, etc.): Some integrations have been further split into their own lightweight packages that only depend on langchain-core.\nlangchain: Chains, agents, and retrieval strategies that make up an application's cognitive architecture.\nlanggraph: Build robust and stateful multi-actor applications with LLMs by modeling steps as edges and nodes in a graph.\nlangserve: Deploy LangChain chains as REST APIs.\nThe broader ecosystem includes:\nLangSmith: A developer platform that lets you debug, test, evaluate, and monitor LLM applications and seamlessly integrates with LangChain.\nGet started\nWe recommend following our Quickstart guide to familiarize yourself with the framework by building your first LangChain application.\nSee here for instructions on how to install LangChain, set up your environment, and start building.\nnote\nThese docs focus on the Python LangChain library. Head here for docs on the JavaScript LangChain library.\nUse cases\nIf you're looking to build something specific or are more of a hands-on learner, check out our use-cases. They're walkthroughs and techniques for common end-to-end tasks, such as:\nQuestion answering with RAG\nExtracting structured output\nChatbots\nand more!\nExpression Language\nLangChain Expression Language (LCEL) is the foundation of many of LangChain's components, and is a declarative way to compose chains. LCEL was designed from day 1 to support putting prototypes in production, with no code changes, from the simplest “prompt + LLM” chain to the most complex chains.\nGet started: LCEL and its benefits\nRunnable interface: The standard interface for LCEL objects\nPrimitives: More on the primitives LCEL includes\nand more!\nEcosystem\n🦜🛠️ LangSmith\nTrace and evaluate your language model applications and intelligent agents to help you move from prototype to production.\n🦜🕸️ LangGraph\nBuild stateful, multi-actor applications with LLMs, built on top of (and intended to be used with) LangChain primitives.\n🦜🏓 LangServe\nDeploy LangChain runnables and chains as REST APIs.\nSecurity\nRead up on our Security best practices to make sure you're developing safely with LangChain.\nAdditional resources\nComponents\nLangChain provides standard, extendable interfaces and integrations for many different components, including:\nIntegrations\nLangChain is part of a rich ecosystem of tools that integrate with our framework and build on top of it. Check out our growing list of integrations.\nGuides\nBest practices for developing with LangChain.\nAPI reference\nHead to the reference section for full documentation of all classes and methods in the LangChain and LangChain Experimental Python packages.\nContributing\nCheck out the developer's guide for guidelines on contributing and help getting your dev environment set up.\nHelp us out by providing feedback on this documentation page:", metadata={'source': 'https://langchain-python.dev.org.tw/'})]

將提取的文字轉換為圖形文檔

可以使用 DiffbotGraphTransformer 進一步處理結構化的頁面內容,以將實體和關係提取到圖形中。

%pip install --upgrade --quiet langchain-experimental
from langchain_experimental.graph_transformers.diffbot import DiffbotGraphTransformer

diffbot_nlp = DiffbotGraphTransformer(
diffbot_api_key=os.environ.get("DIFFBOT_API_TOKEN")
)
graph_documents = diffbot_nlp.convert_to_graph_documents(loader.load())

要繼續將數據載入到知識圖譜中,請按照DiffbotGraphTransformer 指南


此頁面是否對您有幫助?