跳到主要內容
Open In ColabOpen on GitHub

如何分割 JSON 資料

此 json 分割器分割 json 資料,同時允許控制區塊大小。它以深度優先的方式遍歷 json 資料並建構較小的 json 區塊。它嘗試保持巢狀 json 物件完整,但如果需要將區塊保持在 min_chunk_size 和 max_chunk_size 之間,則會分割它們。

如果值不是巢狀 json,而是一個非常大的字串,則該字串將不會被分割。如果您需要對區塊大小進行硬性限制,請考慮將其與這些區塊上的遞迴文本分割器組合使用。有一個可選的預處理步驟可以分割列表,方法是先將列表轉換為 json (dict),然後像這樣分割它們。

  1. 文本分割方式:json 值。
  2. 區塊大小的測量方式:按字元數。
%pip install -qU langchain-text-splitters

首先,我們載入一些 json 資料

import json

import requests

# This is a large nested json object and will be loaded as a python dict
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()

基本用法

指定 max_chunk_size 以限制區塊大小

from langchain_text_splitters import RecursiveJsonSplitter

splitter = RecursiveJsonSplitter(max_chunk_size=300)

若要取得 json 區塊,請使用 .split_json 方法

# Recursively split json data - If you need to access/manipulate the smaller json chunks
json_chunks = splitter.split_json(json_data=json_data)

for chunk in json_chunks[:3]:
print(chunk)
{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'servers': [{'url': 'https://api.smith.langchain.com', 'description': 'LangSmith API endpoint.'}]}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.', 'operationId': 'read_tracer_session_api_v1_sessions__session_id__get'}}}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}

若要取得 LangChain Document 物件,請使用 .create_documents 方法

# The splitter can also output documents
docs = splitter.create_documents(texts=[json_data])

for doc in docs[:3]:
print(doc)
page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "servers": [{"url": "https://api.smith.langchain.com", "description": "LangSmith API endpoint."}]}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}'

或使用 .split_text 直接取得字串內容

texts = splitter.split_text(json_data=json_data)

print(texts[0])
print(texts[1])
{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "servers": [{"url": "https://api.smith.langchain.com", "description": "LangSmith API endpoint."}]}
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}

如何管理列表內容的區塊大小

請注意,在此範例中,其中一個區塊大於指定的 max_chunk_size 300。檢視其中一個較大的區塊,我們看到那裡有一個列表物件

print([len(text) for text in texts][:10])
print()
print(texts[3])
[171, 231, 126, 469, 210, 213, 237, 271, 191, 232]

{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}

json 分割器預設不會分割列表。

指定 convert_lists=True 以預處理 json,將列表內容轉換為具有 index:item 作為 key:val 對的字典

texts = splitter.split_text(json_data=json_data, convert_lists=True)

讓我們看一下區塊的大小。現在它們都小於最大值

print([len(text) for text in texts][:10])
[176, 236, 141, 203, 212, 221, 210, 213, 242, 291]

列表已轉換為字典,但即使分割成多個區塊,仍保留所有需要的上下文資訊

print(texts[1])
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": {"0": "tracer-sessions"}, "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}
# We can also look at the documents
docs[1]
Document(page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}')

此頁面是否對您有幫助?