Timescale Vector (Postgres)
Timescale Vector 是適用於 AI 應用程式的
PostgreSQL++
。它讓您能夠在PostgreSQL
中有效率地儲存和查詢數十億個向量嵌入。PostgreSQL 也稱為
Postgres
,是一個免費且開放原始碼的關聯式資料庫管理系統 (RDBMS),強調可擴充性和SQL
相容性。
本筆記本示範如何使用 Postgres 向量資料庫 (TimescaleVector
) 執行自我查詢。在本筆記本中,我們將示範 SelfQueryRetriever
包裝在 TimescaleVector 向量儲存庫周圍。
什麼是 Timescale Vector?
Timescale Vector 是適用於 AI 應用程式的 PostgreSQL++。
Timescale Vector 讓您能夠在 PostgreSQL
中有效率地儲存和查詢數百萬個向量嵌入。
- 透過 DiskANN 啟發的索引演算法,增強
pgvector
在 10 億以上向量上進行更快且更精確的相似性搜尋。 - 透過自動化基於時間的分區和索引,實現快速的基於時間的向量搜尋。
- 提供熟悉的 SQL 介面,用於查詢向量嵌入和關聯式資料。
Timescale Vector 是適用於 AI 的雲端 PostgreSQL,可隨著您從 POC 擴展到生產環境
- 透過讓您在單一資料庫中儲存關聯式中繼資料、向量嵌入和時間序列資料,簡化操作。
- 受益於穩如磐石的 PostgreSQL 基礎,以及企業級功能,例如串流備份和複製、高可用性和列級安全性。
- 透過企業級安全性和合規性,實現無憂體驗。
如何存取 Timescale Vector
Timescale Vector 在雲端 PostgreSQL 平台 Timescale 上提供。(目前沒有自行託管版本。)
LangChain 使用者可獲得 Timescale Vector 90 天免費試用。
- 若要開始使用,註冊 Timescale,建立新的資料庫並按照本筆記本操作!
- 請參閱 Timescale Vector 解釋部落格,以取得更多詳細資訊和效能基準。
- 請參閱 安裝指示,以取得關於在 Python 中使用 Timescale Vector 的更多詳細資訊。
建立 TimescaleVector 向量儲存庫
首先,我們會想要建立 Timescale Vector 向量儲存庫,並使用一些資料來初始化它。我們建立了一個小型示範文件集,其中包含電影摘要。
注意:自我查詢檢索器要求您安裝 lark
(pip install lark
)。我們也需要 timescale-vector
套件。
%pip install --upgrade --quiet lark
%pip install --upgrade --quiet timescale-vector
在本範例中,我們將使用 OpenAIEmbeddings
,所以讓我們載入您的 OpenAI API 金鑰。
# Get openAI api key by reading local .env file
# The .env file should contain a line starting with `OPENAI_API_KEY=sk-`
import os
from dotenv import find_dotenv, load_dotenv
_ = load_dotenv(find_dotenv())
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
# Alternatively, use getpass to enter the key in a prompt
# import os
# import getpass
# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
若要連線到您的 PostgreSQL 資料庫,您需要您的服務 URI,這可以在建立新資料庫後下載的速查表或 .env
檔案中找到。
如果您尚未執行此操作,請註冊 Timescale,並建立新的資料庫。
URI 看起來會像這樣:postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require
# Get the service url by reading local .env file
# The .env file should contain a line starting with `TIMESCALE_SERVICE_URL=postgresql://`
_ = load_dotenv(find_dotenv())
TIMESCALE_SERVICE_URL = os.environ["TIMESCALE_SERVICE_URL"]
# Alternatively, use getpass to enter the key in a prompt
# import os
# import getpass
# TIMESCALE_SERVICE_URL = getpass.getpass("Timescale Service URL:")
from langchain_community.vectorstores.timescalevector import TimescaleVector
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
以下是我們將用於本示範的範例文件。資料是關於電影的,並且同時具有內容和中繼資料欄位,其中包含關於特定電影的資訊。
docs = [
Document(
page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
),
Document(
page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
),
Document(
page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
),
Document(
page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
),
Document(
page_content="Toys come alive and have a blast doing so",
metadata={"year": 1995, "genre": "animated"},
),
Document(
page_content="Three men walk into the Zone, three men walk out of the Zone",
metadata={
"year": 1979,
"director": "Andrei Tarkovsky",
"genre": "science fiction",
"rating": 9.9,
},
),
]
最後,我們將建立我們的 Timescale Vector 向量儲存庫。請注意,集合名稱將會是用於儲存文件的 PostgreSQL 表格名稱。
COLLECTION_NAME = "langchain_self_query_demo"
vectorstore = TimescaleVector.from_documents(
embedding=embeddings,
documents=docs,
collection_name=COLLECTION_NAME,
service_url=TIMESCALE_SERVICE_URL,
)
建立我們的自我查詢檢索器
現在我們可以實例化我們的檢索器。若要執行此操作,我們需要預先提供一些關於我們的文件支援的中繼資料欄位以及文件內容的簡短描述。
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import OpenAI
# Give LLM info about the metadata fields
metadata_field_info = [
AttributeInfo(
name="genre",
description="The genre of the movie",
type="string or list[string]",
),
AttributeInfo(
name="year",
description="The year the movie was released",
type="integer",
),
AttributeInfo(
name="director",
description="The name of the movie director",
type="string",
),
AttributeInfo(
name="rating", description="A 1-10 rating for the movie", type="float"
),
]
document_content_description = "Brief summary of a movie"
# Instantiate the self-query retriever from an LLM
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)
使用 Timescale Vector 進行自我查詢檢索
現在我們可以嘗試實際使用我們的檢索器!
執行以下查詢,並注意您如何在自然語言中指定查詢、篩選器、複合篩選器(具有 AND、OR 的篩選器),而自我查詢檢索器會將該查詢翻譯成 SQL,並在 Timescale Vector (Postgres) 向量儲存庫上執行搜尋。
這說明了自我查詢檢索器的強大功能。您可以使用它對向量儲存庫執行複雜的搜尋,而無需您或您的使用者直接撰寫任何 SQL!
# This example only specifies a relevant query
retriever.invoke("What are some movies about dinosaurs")
/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/libs/langchain/langchain/chains/llm.py:275: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.
warnings.warn(
``````output
query='dinosaur' filter=None limit=None
[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7}),
Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7}),
Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'}),
Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'})]
# This example only specifies a filter
retriever.invoke("I want to watch a movie rated higher than 8.5")
query=' ' filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8.5) limit=None
[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'}),
Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'}),
Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'rating': 8.6, 'director': 'Satoshi Kon'}),
Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'rating': 8.6, 'director': 'Satoshi Kon'})]
# This example specifies a query and a filter
retriever.invoke("Has Greta Gerwig directed any movies about women")
query='women' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='director', value='Greta Gerwig') limit=None
[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'year': 2019, 'rating': 8.3, 'director': 'Greta Gerwig'}),
Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'year': 2019, 'rating': 8.3, 'director': 'Greta Gerwig'})]
# This example specifies a composite filter
retriever.invoke("What's a highly rated (above 8.5) science fiction film?")
query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GTE: 'gte'>, attribute='rating', value=8.5), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='science fiction')]) limit=None
[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'}),
Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'})]
# This example specifies a query and composite filter
retriever.invoke(
"What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
)
query='toys' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GT: 'gt'>, attribute='year', value=1990), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='year', value=2005), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='animated')]) limit=None
[Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'})]
篩選器 k
我們也可以使用自我查詢檢索器來指定 k
:要擷取的文件數量。
我們可以透過將 enable_limit=True
傳遞給建構函式來執行此操作。
retriever = SelfQueryRetriever.from_llm(
llm,
vectorstore,
document_content_description,
metadata_field_info,
enable_limit=True,
verbose=True,
)
# This example specifies a query with a LIMIT value
retriever.invoke("what are two movies about dinosaurs")
query='dinosaur' filter=None limit=2
[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7}),
Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7})]