跳至主要內容

Azure Cosmos DB Mongo vCore

本筆記本將向您展示如何利用此整合的向量資料庫,將文件儲存在集合中、建立索引,並使用近似最近鄰演算法(例如 COS(餘弦距離)、L2(歐幾里得距離)和 IP(內積))執行向量搜尋查詢,以定位接近查詢向量的文件。

Azure Cosmos DB 是為 OpenAI 的 ChatGPT 服務提供支援的資料庫。 它提供單位數毫秒的回應時間、自動且即時的可擴展性,以及任何規模下的保證速度。

適用於 MongoDB vCore 的 Azure Cosmos DB (https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/vcore/) 為開發人員提供完全受管理的 MongoDB 相容資料庫服務,用於以熟悉的架構建置現代應用程式。您可以應用您的 MongoDB 經驗,並繼續使用您最喜歡的 MongoDB 驅動程式、SDK 和工具,方法是將您的應用程式指向 API for MongoDB vCore 帳戶的連接字串。

註冊 以獲得終身免費存取權,立即開始。

%pip install --upgrade --quiet  pymongo langchain-openai langchain-community
Note: you may need to restart the kernel to use updated packages.
import os

CONNECTION_STRING = "YOUR_CONNECTION_STRING"
INDEX_NAME = "izzy-test-index"
NAMESPACE = "izzy_test_db.izzy_test_collection"
DB_NAME, COLLECTION_NAME = NAMESPACE.split(".")

我們想要使用AzureOpenAIEmbeddings,因此我們需要設定我們的 Azure OpenAI API 金鑰以及其他環境變數。

# Set up the OpenAI Environment Variables

os.environ["AZURE_OPENAI_API_KEY"] = "YOUR_AZURE_OPENAI_API_KEY"
os.environ["AZURE_OPENAI_ENDPOINT"] = "YOUR_AZURE_OPENAI_ENDPOINT"
os.environ["AZURE_OPENAI_API_VERSION"] = "2023-05-15"
os.environ["OPENAI_EMBEDDINGS_MODEL_NAME"] = "text-embedding-ada-002" # the model name

現在,我們需要將文件載入到集合中,建立索引,然後針對索引執行我們的查詢以檢索相符的項目。

如果您對某些參數有疑問,請參閱文件

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores.azure_cosmos_db import (
AzureCosmosDBVectorSearch,
CosmosDBSimilarityType,
CosmosDBVectorSearchType,
)
from langchain_openai import AzureOpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

SOURCE_FILE_NAME = "../../how_to/state_of_the_union.txt"

loader = TextLoader(SOURCE_FILE_NAME)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# OpenAI Settings
model_deployment = os.getenv(
"OPENAI_EMBEDDINGS_DEPLOYMENT", "smart-agent-embedding-ada"
)
model_name = os.getenv("OPENAI_EMBEDDINGS_MODEL_NAME", "text-embedding-ada-002")


openai_embeddings: AzureOpenAIEmbeddings = AzureOpenAIEmbeddings(
model=model_name, chunk_size=1
)
docs[0]
Document(metadata={'source': '../../how_to/state_of_the_union.txt'}, page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.')
from pymongo import MongoClient

# INDEX_NAME = "izzy-test-index-2"
# NAMESPACE = "izzy_test_db.izzy_test_collection"
# DB_NAME, COLLECTION_NAME = NAMESPACE.split(".")

client: MongoClient = MongoClient(CONNECTION_STRING)
collection = client[DB_NAME][COLLECTION_NAME]

model_deployment = os.getenv(
"OPENAI_EMBEDDINGS_DEPLOYMENT", "smart-agent-embedding-ada"
)
model_name = os.getenv("OPENAI_EMBEDDINGS_MODEL_NAME", "text-embedding-ada-002")

vectorstore = AzureCosmosDBVectorSearch.from_documents(
docs,
openai_embeddings,
collection=collection,
index_name=INDEX_NAME,
)

# Read more about these variables in detail here. https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/vcore/vector-search
num_lists = 100
dimensions = 1536
similarity_algorithm = CosmosDBSimilarityType.COS
kind = CosmosDBVectorSearchType.VECTOR_IVF
m = 16
ef_construction = 64
ef_search = 40
score_threshold = 0.1

vectorstore.create_index(
num_lists, dimensions, similarity_algorithm, kind, m, ef_construction
)

"""
# DiskANN vectorstore
maxDegree = 40
dimensions = 1536
similarity_algorithm = CosmosDBSimilarityType.COS
kind = CosmosDBVectorSearchType.VECTOR_DISKANN
lBuild = 20

vectorstore.create_index(
dimensions=dimensions,
similarity=similarity_algorithm,
kind=kind ,
max_degree=maxDegree,
l_build=lBuild,
)

# -----------------------------------------------------------

# HNSW vectorstore
dimensions = 1536
similarity_algorithm = CosmosDBSimilarityType.COS
kind = CosmosDBVectorSearchType.VECTOR_HNSW
m = 16
ef_construction = 64

vectorstore.create_index(
dimensions=dimensions,
similarity=similarity_algorithm,
kind=kind ,
m=m,
ef_construction=ef_construction,
)
"""
'\n# DiskANN vectorstore\nmaxDegree = 40\ndimensions = 1536\nsimilarity_algorithm = CosmosDBSimilarityType.COS\nkind = CosmosDBVectorSearchType.VECTOR_DISKANN\nlBuild = 20\n\nvectorstore.create_index(\n            dimensions=dimensions,\n            similarity=similarity_algorithm,\n            kind=kind ,\n            max_degree=maxDegree,\n            l_build=lBuild,\n        )\n\n# -----------------------------------------------------------\n\n# HNSW vectorstore\ndimensions = 1536\nsimilarity_algorithm = CosmosDBSimilarityType.COS\nkind = CosmosDBVectorSearchType.VECTOR_HNSW\nm = 16\nef_construction = 64\n\nvectorstore.create_index(\n            dimensions=dimensions,\n            similarity=similarity_algorithm,\n            kind=kind ,\n            m=m,\n            ef_construction=ef_construction,\n        )\n'
# perform a similarity search between the embedding of the query and the embeddings of the documents
query = "What did the president say about Ketanji Brown Jackson"
docs = vectorstore.similarity_search(query)
print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

一旦文件被載入且索引已建立,您現在可以直接實例化向量儲存,並針對索引執行查詢。

vectorstore = AzureCosmosDBVectorSearch.from_connection_string(
CONNECTION_STRING, NAMESPACE, openai_embeddings, index_name=INDEX_NAME
)

# perform a similarity search between a query and the ingested documents
query = "What did the president say about Ketanji Brown Jackson"
docs = vectorstore.similarity_search(query)

print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
vectorstore = AzureCosmosDBVectorSearch(
collection, openai_embeddings, index_name=INDEX_NAME
)

# perform a similarity search between a query and the ingested documents
query = "What did the president say about Ketanji Brown Jackson"
docs = vectorstore.similarity_search(query)

print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

篩選的向量搜尋(預覽)

適用於 MongoDB 的 Azure Cosmos DB 支援使用 $lt、$lte、$eq、$neq、$gte、$gt、$in、$nin 和 $regex 進行預先篩選。 若要使用此功能,請在 Azure 訂閱的「預覽功能」索引標籤中啟用「篩選向量搜尋」。 在此處了解更多關於預覽功能的資訊。

# create a filter index
vectorstore.create_filter_index(
property_to_filter="metadata.source", index_name="filter_index"
)
{'raw': {'defaultShard': {'numIndexesBefore': 3,
'numIndexesAfter': 4,
'createdCollectionAutomatically': False,
'ok': 1}},
'ok': 1}
query = "What did the president say about Ketanji Brown Jackson"
docs = vectorstore.similarity_search(
query, pre_filter={"metadata.source": {"$ne": "filter content"}}
)
len(docs)
4
docs = vectorstore.similarity_search(
query,
pre_filter={"metadata.source": {"$ne": "../../how_to/state_of_the_union.txt"}},
)
len(docs)
0

此頁面是否對您有所幫助?