跳至主要內容

阿里巴巴雲 MaxCompute (Alibaba Cloud MaxCompute)

Alibaba Cloud MaxCompute (先前稱為 ODPS) 是一個通用、完全託管、多租戶的資料處理平台,適用於大規模資料倉儲。 MaxCompute 支援各種資料匯入解決方案和分散式運算模型,讓使用者能夠有效地查詢大量資料集、降低生產成本並確保資料安全。

MaxComputeLoader 讓您可以執行 MaxCompute SQL 查詢,並將結果以每列一個文件的方式載入。

%pip install --upgrade --quiet  pyodps
Collecting pyodps
Downloading pyodps-0.11.4.post0-cp39-cp39-macosx_10_9_universal2.whl (2.0 MB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 1.7 MB/s eta 0:00:0000:0100:010m
[?25hRequirement already satisfied: charset-normalizer>=2 in /Users/newboy/anaconda3/envs/langchain/lib/python3.9/site-packages (from pyodps) (3.1.0)
Requirement already satisfied: urllib3<2.0,>=1.26.0 in /Users/newboy/anaconda3/envs/langchain/lib/python3.9/site-packages (from pyodps) (1.26.15)
Requirement already satisfied: idna>=2.5 in /Users/newboy/anaconda3/envs/langchain/lib/python3.9/site-packages (from pyodps) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /Users/newboy/anaconda3/envs/langchain/lib/python3.9/site-packages (from pyodps) (2023.5.7)
Installing collected packages: pyodps
Successfully installed pyodps-0.11.4.post0

基本用法

要實例化載入器,您需要一個要執行的 SQL 查詢、您的 MaxCompute 端點和專案名稱,以及您的存取 ID 和秘密存取金鑰。 存取 ID 和秘密存取金鑰可以直接透過 access_idsecret_access_key 參數傳入,也可以設定為環境變數 MAX_COMPUTE_ACCESS_IDMAX_COMPUTE_SECRET_ACCESS_KEY

from langchain_community.document_loaders import MaxComputeLoader
API 參考:MaxComputeLoader
base_query = """
SELECT *
FROM (
SELECT 1 AS id, 'content1' AS content, 'meta_info1' AS meta_info
UNION ALL
SELECT 2 AS id, 'content2' AS content, 'meta_info2' AS meta_info
UNION ALL
SELECT 3 AS id, 'content3' AS content, 'meta_info3' AS meta_info
) mydata;
"""
endpoint = "<ENDPOINT>"
project = "<PROJECT>"
ACCESS_ID = "<ACCESS ID>"
SECRET_ACCESS_KEY = "<SECRET ACCESS KEY>"
loader = MaxComputeLoader.from_params(
base_query,
endpoint,
project,
access_id=ACCESS_ID,
secret_access_key=SECRET_ACCESS_KEY,
)
data = loader.load()
print(data)
[Document(page_content='id: 1\ncontent: content1\nmeta_info: meta_info1', metadata={}), Document(page_content='id: 2\ncontent: content2\nmeta_info: meta_info2', metadata={}), Document(page_content='id: 3\ncontent: content3\nmeta_info: meta_info3', metadata={})]
print(data[0].page_content)
id: 1
content: content1
meta_info: meta_info1
print(data[0].metadata)
{}

指定哪些欄位是內容 vs. 元數據

您可以使用 page_content_columnsmetadata_columns 參數配置應將哪些欄位的子集載入為 Document 的內容,以及哪些欄位載入為元數據。

loader = MaxComputeLoader.from_params(
base_query,
endpoint,
project,
page_content_columns=["content"], # Specify Document page content
metadata_columns=["id", "meta_info"], # Specify Document metadata
access_id=ACCESS_ID,
secret_access_key=SECRET_ACCESS_KEY,
)
data = loader.load()
print(data[0].page_content)
content: content1
print(data[0].metadata)
{'id': 1, 'meta_info': 'meta_info1'}

此頁面是否對您有幫助?