ZeroxPDFLoader
概述
ZeroxPDFLoader
是一個文檔載入器,它利用 Zerox 庫。Zerox 將 PDF 文檔轉換為圖像,使用具有視覺功能的語言模型處理它們,並生成結構化的 Markdown 表示形式。此載入器允許異步操作,並提供頁面級別的文檔提取。
整合詳情
類別 | 套件 | 本地 | 可序列化 | JS 支援 |
---|---|---|---|---|
ZeroxPDFLoader | langchain_community | ❌ | ❌ | ❌ |
載入器功能
來源 | 文檔延遲載入 | 原生異步支援 |
---|---|---|
ZeroxPDFLoader | ✅ | ❌ |
設定
憑證
需要在環境變數中設定適當的憑證。此載入器支援多種不同的模型和模型供應商。請參閱下面的「用法」標題以查看一些範例,或參閱 Zerox 文檔 以獲取完整支援模型列表。
安裝
要使用 ZeroxPDFLoader
,您需要安裝 zerox
套件。另請確保已安裝 langchain-community
。
pip install zerox langchain-community
初始化
ZeroxPDFLoader
通過將每個頁面轉換為圖像並異步處理,從而使用具有視覺功能的語言模型啟用 PDF 文本提取。要使用此載入器,您需要指定模型並為 Zerox 配置任何必要的環境變數,例如 API 金鑰。
如果您在像 Jupyter Notebook 這樣的環境中工作,您可能需要使用 nest_asyncio
來處理異步程式碼。您可以按如下方式設定它
import nest_asyncio
nest_asyncio.apply()
import os
# use nest_asyncio (only necessary inside of jupyter notebook)
import nest_asyncio
from langchain_community.document_loaders.pdf import ZeroxPDFLoader
nest_asyncio.apply()
# Specify the url or file path for the PDF you want to process
# In this case let's use pdf from web
file_path = "https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf"
# Set up necessary env vars for a vision model
os.environ["OPENAI_API_KEY"] = (
"zK3BAhQUmbwZNoHoOcscBwQdwi3oc3hzwJmbgdZ" ## your-api-key
)
# Initialize ZeroxPDFLoader with the desired model
loader = ZeroxPDFLoader(file_path=file_path, model="azure/gpt-4o-mini")
API 參考文檔:ZeroxPDFLoader
載入
# Load the document and look at the first page:
documents = loader.load()
documents[0]
Document(metadata={'source': 'https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf', 'page': 1, 'num_pages': 5}, page_content='# OpenAI\n\nOpenAI is an AI research laboratory.\n\n#ai-models #ai\n\n## Revenue\n- **$1,000,000,000** \n 2023\n\n## Valuation\n- **$28,000,000,000** \n 2023\n\n## Growth Rate (Y/Y)\n- **400%** \n 2023\n\n## Funding\n- **$11,300,000,000** \n 2023\n\n---\n\n## Details\n- **Headquarters:** San Francisco, CA\n- **CEO:** Sam Altman\n\n[Visit Website](#)\n\n---\n\n## Revenue\n### ARR ($M) | Growth\n--- | ---\n$1000M | 456%\n$750M | \n$500M | \n$250M | $36M\n$0 | $200M\n\nis on track to hit $1B in annual recurring revenue by the end of 2023, up about 400% from an estimated $200M at the end of 2022.\n\nOpenAI overall lost about $540M last year while developing ChatGPT, and those losses are expected to increase dramatically in 2023 with the growth in popularity of their consumer tools, with CEO Sam Altman remarking that OpenAI is likely to be "the most capital-intensive startup in Silicon Valley history."\n\nThe reason for that is operating ChatGPT is massively expensive. One analysis of ChatGPT put the running cost at about $700,000 per day taking into account the underlying costs of GPU hours and hardware. That amount—derived from the 175 billion parameter-large architecture of GPT-3—would be even higher with the 100 trillion parameters of GPT-4.\n\n---\n\n## Valuation\nIn April 2023, OpenAI raised its latest round of $300M at a roughly $29B valuation from Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global.\n\nAssuming OpenAI was at roughly $300M in ARR at the time, that would have given them a 96x forward revenue multiple.\n\n---\n\n## Product\n\n### ChatGPT\n| Examples | Capabilities | Limitations |\n|---------------------------------|-------------------------------------|------------------------------------|\n| "Explain quantum computing in simple terms" | "Remember what users said earlier in the conversation" | May occasionally generate incorrect information |\n| "What can you give me for my dad\'s birthday?" | "Allows users to follow-up questions" | Limited knowledge of world events after 2021 |\n| "How do I make an HTTP request in JavaScript?" | "Trained to provide harmless requests" | |')
# Let's look at parsed first page
print(documents[0].page_content)
# OpenAI
OpenAI is an AI research laboratory.
#ai-models #ai
## Revenue
- **$1,000,000,000**
2023
## Valuation
- **$28,000,000,000**
2023
## Growth Rate (Y/Y)
- **400%**
2023
## Funding
- **$11,300,000,000**
2023
---
## Details
- **Headquarters:** San Francisco, CA
- **CEO:** Sam Altman
[Visit Website](#)
---
## Revenue
### ARR ($M) | Growth
--- | ---
$1000M | 456%
$750M |
$500M |
$250M | $36M
$0 | $200M
is on track to hit $1B in annual recurring revenue by the end of 2023, up about 400% from an estimated $200M at the end of 2022.
OpenAI overall lost about $540M last year while developing ChatGPT, and those losses are expected to increase dramatically in 2023 with the growth in popularity of their consumer tools, with CEO Sam Altman remarking that OpenAI is likely to be "the most capital-intensive startup in Silicon Valley history."
The reason for that is operating ChatGPT is massively expensive. One analysis of ChatGPT put the running cost at about $700,000 per day taking into account the underlying costs of GPU hours and hardware. That amount—derived from the 175 billion parameter-large architecture of GPT-3—would be even higher with the 100 trillion parameters of GPT-4.
---
## Valuation
In April 2023, OpenAI raised its latest round of $300M at a roughly $29B valuation from Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global.
Assuming OpenAI was at roughly $300M in ARR at the time, that would have given them a 96x forward revenue multiple.
---
## Product
### ChatGPT
| Examples | Capabilities | Limitations |
|---------------------------------|-------------------------------------|------------------------------------|
| "Explain quantum computing in simple terms" | "Remember what users said earlier in the conversation" | May occasionally generate incorrect information |
| "What can you give me for my dad's birthday?" | "Allows users to follow-up questions" | Limited knowledge of world events after 2021 |
| "How do I make an HTTP request in JavaScript?" | "Trained to provide harmless requests" | |
延遲載入
載入器始終延遲獲取結果。 .load()
方法等效於 .lazy_load()
API 參考文檔
ZeroxPDFLoader
此載入器類別使用檔案路徑和模型類型初始化,並支援通過 zerox_kwargs
進行自定義配置,以處理 Zerox 特定的參數。
引數:
file_path
(Union[str, Path]):PDF 檔案的路徑。model
(str):用於處理的具有視覺功能的模型,格式為<provider>/<model>
。一些有效值的範例包括model = "gpt-4o-mini" ## openai 模型
model = "azure/gpt-4o-mini"
model = "gemini/gpt-4o-mini"
model="claude-3-opus-20240229"
model = "vertex_ai/gemini-1.5-flash-001"
- 請參閱 Zerox 文檔 中的更多詳細資訊
- 預設為
"gpt-4o-mini"。
**zerox_kwargs
(dict):其他 Zerox 特定的參數,例如 API 金鑰、端點等。- 請參閱 Zerox 文檔
方法:
lazy_load
:生成Document
實例的迭代器,每個實例代表 PDF 的一個頁面,以及包含頁碼和來源的元數據。
請參閱完整的 API 文檔 這裡
筆記
- 模型兼容性:Zerox 支援各種具有視覺功能的模型。有關支援模型和配置詳細資訊的列表,請參閱 Zerox 的 GitHub 文檔。
- 環境變數:請務必設定 Zerox 文檔中指定的必要環境變數,例如
API_KEY
或端點詳細資訊。 - 異步處理:如果您在 Jupyter Notebooks 中遇到與事件迴圈相關的錯誤,您可能需要應用
nest_asyncio
,如設定部分所示。
疑難排解
- RuntimeError:此事件迴圈已在運行:使用
nest_asyncio.apply()
來防止 Jupyter 等環境中的異步迴圈衝突。 - 配置錯誤:驗證
zerox_kwargs
是否與您選擇的模型的預期引數匹配,並且是否設定了所有必要的環境變數。
其他資源
- Zerox 文檔:Zerox GitHub 儲存庫
- LangChain 文檔載入器:LangChain 文檔