騰訊雲向量資料庫
騰訊雲向量資料庫 是一款全託管、自主研發的企業級分散式資料庫服務,專為儲存、檢索和分析多維向量資料而設計。該資料庫支援多種索引類型和相似度計算方法。單一索引最多可支援 10 億向量規模,並可支援數百萬 QPS 和毫秒級查詢延遲。騰訊雲向量資料庫不僅可以為大型模型提供外部知識庫,以提高大型模型回應的準確性,還可以廣泛應用於推薦系統、NLP 服務、電腦視覺和智慧客服等 AI 領域。
此筆記本展示如何使用與騰訊向量資料庫相關的功能。
若要執行,您應該擁有一個資料庫執行個體。
基本用法
!pip3 install tcvectordb langchain-community
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings.fake import FakeEmbeddings
from langchain_community.vectorstores import TencentVectorDB
from langchain_community.vectorstores.tencentvectordb import ConnectionParams
from langchain_text_splitters import CharacterTextSplitter
載入文件,將它們分割成塊。
loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
我們支援兩種嵌入文件的方式
- 使用任何與 Langchain Embeddings 相容的 Embeddings 模型。
- 指定騰訊 VectorStore DB 的 Embedding 模型名稱,選項為
bge-base-zh
,維度:768m3e-base
,維度:768text2vec-large-chinese
,維度:1024e5-large-v2
,維度:1024multilingual-e5-base
,維度:768
以下程式碼顯示了兩種嵌入文件的方式,您可以透過註解另一種方式來選擇其中一種
## you can use a Langchain Embeddings model, like OpenAIEmbeddings:
# from langchain_community.embeddings.openai import OpenAIEmbeddings
#
# embeddings = OpenAIEmbeddings()
# t_vdb_embedding = None
## Or you can use a Tencent Embedding model, like `bge-base-zh`:
t_vdb_embedding = "bge-base-zh" # bge-base-zh is the default model
embeddings = None
API 參考:OpenAIEmbeddings
現在我們可以建立一個 TencentVectorDB 執行個體,您必須至少提供 embeddings
或 t_vdb_embedding
參數之一。如果同時提供兩者,將使用 embeddings
參數
conn_params = ConnectionParams(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
username="root",
timeout=20,
)
vector_db = TencentVectorDB.from_documents(
docs, embeddings, connection_params=conn_params, t_vdb_embedding=t_vdb_embedding
)
query = "What did the president say about Ketanji Brown Jackson"
docs = vector_db.similarity_search(query)
docs[0].page_content
'Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'
vector_db = TencentVectorDB(embeddings, conn_params)
vector_db.add_texts(["Ankush went to Princeton"])
query = "Where did Ankush go to college?"
docs = vector_db.max_marginal_relevance_search(query)
docs[0].page_content
'Ankush went to Princeton'
Metadata 和篩選
騰訊 VectorDB 支援 metadata 和篩選。您可以將 metadata 新增至文件,並根據 metadata 篩選搜尋結果。
現在我們將建立一個具有 metadata 的新 TencentVectorDB 集合,並示範如何根據 metadata 篩選搜尋結果
from langchain_community.vectorstores.tencentvectordb import (
META_FIELD_TYPE_STRING,
META_FIELD_TYPE_UINT64,
ConnectionParams,
MetaField,
TencentVectorDB,
)
from langchain_core.documents import Document
meta_fields = [
MetaField(name="year", data_type=META_FIELD_TYPE_UINT64, index=True),
MetaField(name="rating", data_type=META_FIELD_TYPE_STRING, index=False),
MetaField(name="genre", data_type=META_FIELD_TYPE_STRING, index=True),
MetaField(name="director", data_type=META_FIELD_TYPE_STRING, index=True),
]
docs = [
Document(
page_content="The Shawshank Redemption is a 1994 American drama film written and directed by Frank Darabont.",
metadata={
"year": 1994,
"rating": "9.3",
"genre": "drama",
"director": "Frank Darabont",
},
),
Document(
page_content="The Godfather is a 1972 American crime film directed by Francis Ford Coppola.",
metadata={
"year": 1972,
"rating": "9.2",
"genre": "crime",
"director": "Francis Ford Coppola",
},
),
Document(
page_content="The Dark Knight is a 2008 superhero film directed by Christopher Nolan.",
metadata={
"year": 2008,
"rating": "9.0",
"genre": "superhero",
"director": "Christopher Nolan",
},
),
Document(
page_content="Inception is a 2010 science fiction action film written and directed by Christopher Nolan.",
metadata={
"year": 2010,
"rating": "8.8",
"genre": "science fiction",
"director": "Christopher Nolan",
},
),
]
vector_db = TencentVectorDB.from_documents(
docs,
None,
connection_params=ConnectionParams(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
username="root",
timeout=20,
),
collection_name="movies",
meta_fields=meta_fields,
)
query = "film about dream by Christopher Nolan"
# you can use the tencentvectordb filtering syntax with the `expr` parameter:
result = vector_db.similarity_search(query, expr='director="Christopher Nolan"')
# you can either use the langchain filtering syntax with the `filter` parameter:
# result = vector_db.similarity_search(query, filter='eq("director", "Christopher Nolan")')
result
[Document(page_content='The Dark Knight is a 2008 superhero film directed by Christopher Nolan.', metadata={'year': 2008, 'rating': '9.0', 'genre': 'superhero', 'director': 'Christopher Nolan'}),
Document(page_content='The Dark Knight is a 2008 superhero film directed by Christopher Nolan.', metadata={'year': 2008, 'rating': '9.0', 'genre': 'superhero', 'director': 'Christopher Nolan'}),
Document(page_content='The Dark Knight is a 2008 superhero film directed by Christopher Nolan.', metadata={'year': 2008, 'rating': '9.0', 'genre': 'superhero', 'director': 'Christopher Nolan'}),
Document(page_content='Inception is a 2010 science fiction action film written and directed by Christopher Nolan.', metadata={'year': 2010, 'rating': '8.8', 'genre': 'science fiction', 'director': 'Christopher Nolan'})]