OpenAI metadata tagger
標記已攝取的文件與結構化元數據通常很有用,例如文件的標題、語氣或長度,以便稍後進行更有針對性的相似性搜尋。但是,對於大量文件,手動執行此標籤過程可能很乏味。
OpenAIMetadataTagger
文件轉換器透過根據提供的模式從每個提供的文件中提取元數據,從而自動化此過程。它在底層使用可配置的 OpenAI Functions
驅動的鏈,因此如果您傳遞自定義 LLM 實例,它必須是具有函數支持的 OpenAI
模型。
**注意:** 此文件轉換器最適用於完整的文件,因此最好先對完整文件運行它,然後再進行任何其他拆分或處理!
例如,假設您想要索引一組電影評論。您可以初始化具有有效 JSON Schema
對象的文件轉換器,如下所示
from langchain_community.document_transformers.openai_functions import (
create_metadata_tagger,
)
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI
schema = {
"properties": {
"movie_title": {"type": "string"},
"critic": {"type": "string"},
"tone": {"type": "string", "enum": ["positive", "negative"]},
"rating": {
"type": "integer",
"description": "The number of stars the critic rated the movie",
},
},
"required": ["movie_title", "critic", "tone"],
}
# Must be an OpenAI model that supports functions
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")
document_transformer = create_metadata_tagger(metadata_schema=schema, llm=llm)
然後您可以簡單地將文件轉換器傳遞文件列表,它將從內容中提取元數據
original_documents = [
Document(
page_content="Review of The Bee Movie\nBy Roger Ebert\n\nThis is the greatest movie ever made. 4 out of 5 stars."
),
Document(
page_content="Review of The Godfather\nBy Anonymous\n\nThis movie was super boring. 1 out of 5 stars.",
metadata={"reliable": False},
),
]
enhanced_documents = document_transformer.transform_documents(original_documents)
import json
print(
*[d.page_content + "\n\n" + json.dumps(d.metadata) for d in enhanced_documents],
sep="\n\n---------------\n\n",
)
Review of The Bee Movie
By Roger Ebert
This is the greatest movie ever made. 4 out of 5 stars.
{"movie_title": "The Bee Movie", "critic": "Roger Ebert", "tone": "positive", "rating": 4}
---------------
Review of The Godfather
By Anonymous
This movie was super boring. 1 out of 5 stars.
{"movie_title": "The Godfather", "critic": "Anonymous", "tone": "negative", "rating": 1, "reliable": false}
然後,新文件可以在載入向量儲存庫之前,由文本分割器進一步處理。提取的字段不會覆蓋現有的元數據。
您還可以初始化具有 Pydantic 模式的文件轉換器
from typing import Literal
from pydantic import BaseModel, Field
class Properties(BaseModel):
movie_title: str
critic: str
tone: Literal["positive", "negative"]
rating: int = Field(description="Rating out of 5 stars")
document_transformer = create_metadata_tagger(Properties, llm)
enhanced_documents = document_transformer.transform_documents(original_documents)
print(
*[d.page_content + "\n\n" + json.dumps(d.metadata) for d in enhanced_documents],
sep="\n\n---------------\n\n",
)
Review of The Bee Movie
By Roger Ebert
This is the greatest movie ever made. 4 out of 5 stars.
{"movie_title": "The Bee Movie", "critic": "Roger Ebert", "tone": "positive", "rating": 4}
---------------
Review of The Godfather
By Anonymous
This movie was super boring. 1 out of 5 stars.
{"movie_title": "The Godfather", "critic": "Anonymous", "tone": "negative", "rating": 1, "reliable": false}
自訂
您可以在文件轉換器建構子中傳遞底層標籤鏈的標準 LLMChain 參數。例如,如果您想要求 LLM 專注於輸入文件中的特定細節,或以特定風格提取元數據,您可以傳遞自定義提示
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template(
"""Extract relevant information from the following text.
Anonymous critics are actually Roger Ebert.
{input}
"""
)
document_transformer = create_metadata_tagger(schema, llm, prompt=prompt)
enhanced_documents = document_transformer.transform_documents(original_documents)
print(
*[d.page_content + "\n\n" + json.dumps(d.metadata) for d in enhanced_documents],
sep="\n\n---------------\n\n",
)
API 參考:ChatPromptTemplate
Review of The Bee Movie
By Roger Ebert
This is the greatest movie ever made. 4 out of 5 stars.
{"movie_title": "The Bee Movie", "critic": "Roger Ebert", "tone": "positive", "rating": 4}
---------------
Review of The Godfather
By Anonymous
This movie was super boring. 1 out of 5 stars.
{"movie_title": "The Godfather", "critic": "Roger Ebert", "tone": "negative", "rating": 1, "reliable": false}