如何分割 JSON 資料
此 json 分割器分割 json 資料,同時允許控制區塊大小。它以深度優先的方式遍歷 json 資料並建構較小的 json 區塊。它嘗試保持巢狀 json 物件完整,但如果需要將區塊保持在 min_chunk_size 和 max_chunk_size 之間,則會分割它們。
如果值不是巢狀 json,而是一個非常大的字串,則該字串將不會被分割。如果您需要對區塊大小進行硬性限制,請考慮將其與這些區塊上的遞迴文本分割器組合使用。有一個可選的預處理步驟可以分割列表,方法是先將列表轉換為 json (dict),然後像這樣分割它們。
- 文本分割方式:json 值。
- 區塊大小的測量方式:按字元數。
%pip install -qU langchain-text-splitters
首先,我們載入一些 json 資料
import json
import requests
# This is a large nested json object and will be loaded as a python dict
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()
基本用法
指定 max_chunk_size
以限制區塊大小
from langchain_text_splitters import RecursiveJsonSplitter
splitter = RecursiveJsonSplitter(max_chunk_size=300)
API 參考:RecursiveJsonSplitter
若要取得 json 區塊,請使用 .split_json
方法
# Recursively split json data - If you need to access/manipulate the smaller json chunks
json_chunks = splitter.split_json(json_data=json_data)
for chunk in json_chunks[:3]:
print(chunk)
{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'servers': [{'url': 'https://api.smith.langchain.com', 'description': 'LangSmith API endpoint.'}]}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.', 'operationId': 'read_tracer_session_api_v1_sessions__session_id__get'}}}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}
若要取得 LangChain Document 物件,請使用 .create_documents
方法
# The splitter can also output documents
docs = splitter.create_documents(texts=[json_data])
for doc in docs[:3]:
print(doc)
page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "servers": [{"url": "https://api.smith.langchain.com", "description": "LangSmith API endpoint."}]}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}'
或使用 .split_text
直接取得字串內容
texts = splitter.split_text(json_data=json_data)
print(texts[0])
print(texts[1])
{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "servers": [{"url": "https://api.smith.langchain.com", "description": "LangSmith API endpoint."}]}
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}
如何管理列表內容的區塊大小
請注意,在此範例中,其中一個區塊大於指定的 max_chunk_size
300。檢視其中一個較大的區塊,我們看到那裡有一個列表物件
print([len(text) for text in texts][:10])
print()
print(texts[3])
[171, 231, 126, 469, 210, 213, 237, 271, 191, 232]
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}
json 分割器預設不會分割列表。
指定 convert_lists=True
以預處理 json,將列表內容轉換為具有 index:item
作為 key:val
對的字典
texts = splitter.split_text(json_data=json_data, convert_lists=True)
讓我們看一下區塊的大小。現在它們都小於最大值
print([len(text) for text in texts][:10])
[176, 236, 141, 203, 212, 221, 210, 213, 242, 291]
列表已轉換為字典,但即使分割成多個區塊,仍保留所有需要的上下文資訊
print(texts[1])
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": {"0": "tracer-sessions"}, "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}
# We can also look at the documents
docs[1]
Document(page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}')