跳至主要內容

ScrapeGraph

此筆記本提供快速入門 ScrapeGraph 工具的概述。 如需所有 ScrapeGraph 功能和配置的詳細文檔,請前往API 參考

有關 ScrapeGraph AI 的更多資訊

概述

整合細節

類別套件可序列化JS 支援套件最新版本
SmartScraperToollangchain-scrapegraphPyPI - Version
MarkdownifyToollangchain-scrapegraphPyPI - Version
LocalScraperToollangchain-scrapegraphPyPI - Version
GetCreditsToollangchain-scrapegraphPyPI - Version

工具功能

工具目的輸入輸出
SmartScraperTool從網站提取結構化資料網址 + 提示JSON
MarkdownifyTool將網頁轉換為 Markdown網址Markdown 文字
LocalScraperTool從 HTML 內容提取資料HTML + 提示JSON
GetCreditsTool檢查 API 點數點數資訊

設定

此整合需要以下套件

%pip install --quiet -U langchain-scrapegraph
Note: you may need to restart the kernel to use updated packages.

憑證

您需要 ScrapeGraph AI API 金鑰才能使用這些工具。 請前往 scrapegraphai.com 取得。

import getpass
import os

if not os.environ.get("SGAI_API_KEY"):
os.environ["SGAI_API_KEY"] = getpass.getpass("ScrapeGraph AI API key:\n")

設定 LangSmith 也有助於獲得最佳的可觀察性 (但不是必需的)

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

實例化

在此,我們展示如何實例化 ScrapeGraph 工具的實例

from langchain_scrapegraph.tools import (
GetCreditsTool,
LocalScraperTool,
MarkdownifyTool,
SmartScraperTool,
)

smartscraper = SmartScraperTool()
markdownify = MarkdownifyTool()
localscraper = LocalScraperTool()
credits = GetCreditsTool()

調用

使用參數直接調用

讓我們單獨嘗試每個工具

# SmartScraper
result = smartscraper.invoke(
{
"user_prompt": "Extract the company name and description",
"website_url": "https://scrapegraphai.com",
}
)
print("SmartScraper Result:", result)

# Markdownify
markdown = markdownify.invoke({"website_url": "https://scrapegraphai.com"})
print("\nMarkdownify Result (first 200 chars):", markdown[:200])

local_html = """
<html>
<body>
<h1>Company Name</h1>
<p>We are a technology company focused on AI solutions.</p>
<div class="contact">
<p>Email: contact@example.com</p>
<p>Phone: (555) 123-4567</p>
</div>
</body>
</html>
"""

# LocalScraper
result_local = localscraper.invoke(
{
"user_prompt": "Make a summary of the webpage and extract the email and phone number",
"website_html": local_html,
}
)
print("LocalScraper Result:", result_local)

# Check credits
credits_info = credits.invoke({})
print("\nCredits Info:", credits_info)
SmartScraper Result: {'company_name': 'ScrapeGraphAI', 'description': "ScrapeGraphAI is a powerful AI web scraping tool that turns entire websites into clean, structured data through a simple API. It's designed to help developers and AI companies extract valuable data from websites efficiently and transform it into formats that are ready for use in LLM applications and data analysis."}

Markdownify Result (first 200 chars): [![ScrapeGraphAI Logo](https://scrapegraphai.com/images/scrapegraphai_logo.svg)ScrapeGraphAI](https://scrapegraphai.com/)

PartnersPricingFAQ[Blog](https://scrapegraphai.com/blog)DocsLog inSign up

Op
LocalScraper Result: {'company_name': 'Company Name', 'description': 'We are a technology company focused on AI solutions.', 'contact': {'email': 'contact@example.com', 'phone': '(555) 123-4567'}}

Credits Info: {'remaining_credits': 49679, 'total_credits_used': 914}

使用 ToolCall 調用

我們也可以使用模型產生的 ToolCall 調用該工具

model_generated_tool_call = {
"args": {
"user_prompt": "Extract the main heading and description",
"website_url": "https://scrapegraphai.com",
},
"id": "1",
"name": smartscraper.name,
"type": "tool_call",
}
smartscraper.invoke(model_generated_tool_call)
ToolMessage(content='{"main_heading": "Get the data you need from any website", "description": "Easily extract and gather information with just a few lines of code with a simple api. Turn websites into clean and usable structured data."}', name='SmartScraper', tool_call_id='1')

鏈結

讓我們將我們的工具與 LLM 一起使用來分析網站

pip install -qU langchain-openai
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig, chain

prompt = ChatPromptTemplate(
[
(
"system",
"You are a helpful assistant that can use tools to extract structured information from websites.",
),
("human", "{user_input}"),
("placeholder", "{messages}"),
]
)

llm_with_tools = llm.bind_tools([smartscraper], tool_choice=smartscraper.name)
llm_chain = prompt | llm_with_tools


@chain
def tool_chain(user_input: str, config: RunnableConfig):
input_ = {"user_input": user_input}
ai_msg = llm_chain.invoke(input_, config=config)
tool_msgs = smartscraper.batch(ai_msg.tool_calls, config=config)
return llm_chain.invoke({**input_, "messages": [ai_msg, *tool_msgs]}, config=config)


tool_chain.invoke(
"What does ScrapeGraph AI do? Extract this information from their website https://scrapegraphai.com"
)
AIMessage(content='ScrapeGraph AI is an AI-powered web scraping tool that efficiently extracts and converts website data into structured formats via a simple API. It caters to developers, data scientists, and AI researchers, offering features like easy integration, support for dynamic content, and scalability for large projects. It supports various website types, including business, e-commerce, and educational sites. Contact: contact@scrapegraphai.com.', additional_kwargs={'tool_calls': [{'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'function': {'arguments': '{"user_prompt":"Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.","website_url":"https://scrapegraphai.com"}', 'name': 'SmartScraper'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 47, 'prompt_tokens': 480, 'total_tokens': 527, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_c7ca0ebaca', 'finish_reason': 'stop', 'logprobs': None}, id='run-45a12c86-d499-4273-8c59-0db926799bc7-0', tool_calls=[{'name': 'SmartScraper', 'args': {'user_prompt': 'Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.', 'website_url': 'https://scrapegraphai.com'}, 'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'type': 'tool_call'}], usage_metadata={'input_tokens': 480, 'output_tokens': 47, 'total_tokens': 527, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

API 參考

如需所有 ScrapeGraph 功能和配置的詳細文檔,請前往 Langchain API 參考:https://langchain-python.dev.org.tw/docs/integrations/tools/scrapegraph

或前往官方 SDK 儲存庫:https://github.com/ScrapeGraphAI/langchain-scrapegraph


此頁面是否有幫助?