跳至主要內容

Postgres 嵌入

Postgres Embedding 是一個開放原始碼的向量相似度搜尋,用於 Postgres,它使用 Hierarchical Navigable Small Worlds (HNSW) 來進行近似最近鄰搜尋。

It supports(它支援)

  • exact and approximate nearest neighbor search using HNSW(使用 HNSW 進行精確和近似最近鄰搜尋)
  • L2 distance(L2 距離)

This notebook shows how to use the Postgres vector database (PGEmbedding).(本筆記本展示如何使用 Postgres 向量資料庫 (PGEmbedding)。)

The PGEmbedding integration creates the pg_embedding extension for you, but you run the following Postgres query to add it(PGEmbedding 整合會為您建立 pg_embedding 擴充功能,但您可以執行以下 Postgres 查詢來新增它)

CREATE EXTENSION embedding;
# Pip install necessary package
%pip install --upgrade --quiet langchain-openai langchain-community
%pip install --upgrade --quiet psycopg2-binary
%pip install --upgrade --quiet tiktoken

Add the OpenAI API Key to the environment variables to use OpenAIEmbeddings.(將 OpenAI API 金鑰新增至環境變數以使用 OpenAIEmbeddings。)

import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
OpenAI API Key:········
## Loading Environment Variables
from typing import List, Tuple
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import PGEmbedding
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
if "DATABASE_URL" not in os.environ:
os.environ["DATABASE_URL"] = getpass.getpass("Database Url:")
Database Url:········
loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
connection_string = os.environ.get("DATABASE_URL")
collection_name = "state_of_the_union"
db = PGEmbedding.from_documents(
embedding=embeddings,
documents=docs,
collection_name=collection_name,
connection_string=connection_string,
)

query = "What did the president say about Ketanji Brown Jackson"
docs_with_score: List[Tuple[Document, float]] = db.similarity_search_with_score(query)
for doc, score in docs_with_score:
print("-" * 80)
print("Score: ", score)
print(doc.page_content)
print("-" * 80)

Working with vectorstore in Postgres(在 Postgres 中使用向量資料庫)

Uploading a vectorstore in PG(在 PG 中上傳向量資料庫)

db = PGEmbedding.from_documents(
embedding=embeddings,
documents=docs,
collection_name=collection_name,
connection_string=connection_string,
pre_delete_collection=False,
)

Create HNSW Index(建立 HNSW 索引)

By default, the extension performs a sequential scan search, with 100% recall. You might consider creating an HNSW index for approximate nearest neighbor (ANN) search to speed up similarity_search_with_score execution time. To create the HNSW index on your vector column, use a create_hnsw_index function(預設情況下,擴充功能執行循序掃描搜尋,並具有 100% 的召回率。您可以考慮建立 HNSW 索引以進行近似最近鄰 (ANN) 搜尋,以加快 similarity_search_with_score 的執行時間。若要在您的向量欄位上建立 HNSW 索引,請使用 create_hnsw_index 函數)

PGEmbedding.create_hnsw_index(
max_elements=10000, dims=1536, m=8, ef_construction=16, ef_search=16
)

The function above is equivalent to running the below SQL query(上面的函數等同於執行下面的 SQL 查詢)

CREATE INDEX ON vectors USING hnsw(vec) WITH (maxelements=10000, dims=1536, m=3, efconstruction=16, efsearch=16);

The HNSW index options used in the statement above include(上面語句中使用的 HNSW 索引選項包括)

  • maxelements: Defines the maximum number of elements indexed. This is a required parameter. The example shown above has a value of 3. A real-world example would have a much large value, such as 1000000. An "element" refers to a data point (a vector) in the dataset, which is represented as a node in the HNSW graph. Typically, you would set this option to a value able to accommodate the number of rows in your in your dataset.(maxelements:定義索引的最大元素數量。這是一個必填參數。上面顯示的範例值為 3。現實世界的範例將具有更大的值,例如 1000000。“元素”指的是數據集中的數據點(向量),它表示為 HNSW 圖中的一個節點。通常,您會將此選項設定為能夠容納數據集中行數的值。)

  • dims: Defines the number of dimensions in your vector data. This is a required parameter. A small value is used in the example above. If you are storing data generated using OpenAI's text-embedding-ada-002 model, which supports 1536 dimensions, you would define a value of 1536, for example.(dims:定義向量數據中的維度數量。這是一個必填參數。上面的範例中使用了一個小值。如果您正在儲存使用 OpenAI 的 text-embedding-ada-002 模型生成的數據(該模型支援 1536 個維度),則可以將值定義為 1536,例如。)

  • m: Defines the maximum number of bi-directional links (also referred to as "edges") created for each node during graph construction. The following additional index options are supported(m:定義在圖形建構期間為每個節點建立的最大雙向鏈接(也稱為“邊”)。支援以下其他索引選項)

  • efConstruction: Defines the number of nearest neighbors considered during index construction. The default value is 32.(efConstruction:定義在索引建構期間考慮的最近鄰居數量。預設值為 32。)

  • efsearch: Defines the number of nearest neighbors considered during index search. The default value is 32. For information about how you can configure these options to influence the HNSW algorithm, refer to Tuning the HNSW algorithm.(efsearch:定義在索引搜尋期間考慮的最近鄰居數量。預設值為 32。有關如何配置這些選項以影響 HNSW 演算法的信息,請參閱 調整 HNSW 演算法。)

Retrieving a vectorstore in PG(在 PG 中檢索向量資料庫)

store = PGEmbedding(
connection_string=connection_string,
embedding_function=embeddings,
collection_name=collection_name,
)

retriever = store.as_retriever()
retriever
VectorStoreRetriever(vectorstore=<langchain_community.vectorstores.pghnsw.HNSWVectoreStore object at 0x121d3c8b0>, search_type='similarity', search_kwargs={})
db1 = PGEmbedding.from_existing_index(
embedding=embeddings,
collection_name=collection_name,
pre_delete_collection=False,
connection_string=connection_string,
)

query = "What did the president say about Ketanji Brown Jackson"
docs_with_score: List[Tuple[Document, float]] = db1.similarity_search_with_score(query)
for doc, score in docs_with_score:
print("-" * 80)
print("Score: ", score)
print(doc.page_content)
print("-" * 80)

Was this page helpful?(此頁面是否對您有幫助?)