騰訊雲 VectorDB
騰訊雲 VectorDB 是一款全託管、自主研發的企業級分散式資料庫服務,專為儲存、檢索和分析多維向量資料而設計。 該資料庫支援多種索引類型和相似性計算方法。 單個索引最多可支援 10 億個向量規模,並可支援數百萬 QPS 和毫秒級查詢延遲。 騰訊雲向量資料庫不僅可以為大型模型提供外部知識庫,以提高大型模型的回應準確性,還可以廣泛應用於推薦系統、NLP 服務、電腦視覺和智慧客服等人工智慧領域。
此筆記本展示如何使用與騰訊向量資料庫相關的功能。
要執行此程式碼,您應該擁有一個資料庫執行個體。
基本用法
!pip3 install tcvectordb langchain-community
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings.fake import FakeEmbeddings
from langchain_community.vectorstores import TencentVectorDB
from langchain_community.vectorstores.tencentvectordb import ConnectionParams
from langchain_text_splitters import CharacterTextSplitter
載入文件,將它們分割成區塊。
loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
我們支援兩種嵌入文件的方式
- 使用任何與 Langchain Embeddings 相容的嵌入模型。
- 指定騰訊 VectorStore DB 的嵌入模型名稱,選項包括
bge-base-zh
,維度:768m3e-base
,維度:768text2vec-large-chinese
,維度:1024e5-large-v2
,維度:1024multilingual-e5-base
,維度:768
以下程式碼展示了嵌入文件的兩種方式,您可以透過註解掉其中一種來選擇一種
## you can use a Langchain Embeddings model, like OpenAIEmbeddings:
# from langchain_community.embeddings.openai import OpenAIEmbeddings
#
# embeddings = OpenAIEmbeddings()
# t_vdb_embedding = None
## Or you can use a Tencent Embedding model, like `bge-base-zh`:
t_vdb_embedding = "bge-base-zh" # bge-base-zh is the default model
embeddings = None
API 參考:OpenAIEmbeddings
現在我們可以建立一個 TencentVectorDB 實例,您必須提供至少一個 embeddings
或 t_vdb_embedding
參數。 如果同時提供兩者,將使用 embeddings
參數
conn_params = ConnectionParams(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
username="root",
timeout=20,
)
vector_db = TencentVectorDB.from_documents(
docs, embeddings, connection_params=conn_params, t_vdb_embedding=t_vdb_embedding
)
query = "What did the president say about Ketanji Brown Jackson"
docs = vector_db.similarity_search(query)
docs[0].page_content
'Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'
vector_db = TencentVectorDB(embeddings, conn_params)
vector_db.add_texts(["Ankush went to Princeton"])
query = "Where did Ankush go to college?"
docs = vector_db.max_marginal_relevance_search(query)
docs[0].page_content
'Ankush went to Princeton'
中繼資料和篩選
騰訊 VectorDB 支援中繼資料和篩選。 您可以將中繼資料新增至文件,並根據中繼資料篩選搜尋結果。
現在我們將建立一個帶有中繼資料的新 TencentVectorDB 集合,並示範如何根據中繼資料篩選搜尋結果
from langchain_community.vectorstores.tencentvectordb import (
META_FIELD_TYPE_STRING,
META_FIELD_TYPE_UINT64,
ConnectionParams,
MetaField,
TencentVectorDB,
)
from langchain_core.documents import Document
meta_fields = [
MetaField(name="year", data_type=META_FIELD_TYPE_UINT64, index=True),
MetaField(name="rating", data_type=META_FIELD_TYPE_STRING, index=False),
MetaField(name="genre", data_type=META_FIELD_TYPE_STRING, index=True),
MetaField(name="director", data_type=META_FIELD_TYPE_STRING, index=True),
]
docs = [
Document(
page_content="The Shawshank Redemption is a 1994 American drama film written and directed by Frank Darabont.",
metadata={
"year": 1994,
"rating": "9.3",
"genre": "drama",
"director": "Frank Darabont",
},
),
Document(
page_content="The Godfather is a 1972 American crime film directed by Francis Ford Coppola.",
metadata={
"year": 1972,
"rating": "9.2",
"genre": "crime",
"director": "Francis Ford Coppola",
},
),
Document(
page_content="The Dark Knight is a 2008 superhero film directed by Christopher Nolan.",
metadata={
"year": 2008,
"rating": "9.0",
"genre": "superhero",
"director": "Christopher Nolan",
},
),
Document(
page_content="Inception is a 2010 science fiction action film written and directed by Christopher Nolan.",
metadata={
"year": 2010,
"rating": "8.8",
"genre": "science fiction",
"director": "Christopher Nolan",
},
),
]
vector_db = TencentVectorDB.from_documents(
docs,
None,
connection_params=ConnectionParams(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
username="root",
timeout=20,
),
collection_name="movies",
meta_fields=meta_fields,
)
query = "film about dream by Christopher Nolan"
# you can use the tencentvectordb filtering syntax with the `expr` parameter:
result = vector_db.similarity_search(query, expr='director="Christopher Nolan"')
# you can either use the langchain filtering syntax with the `filter` parameter:
# result = vector_db.similarity_search(query, filter='eq("director", "Christopher Nolan")')
result
[Document(page_content='The Dark Knight is a 2008 superhero film directed by Christopher Nolan.', metadata={'year': 2008, 'rating': '9.0', 'genre': 'superhero', 'director': 'Christopher Nolan'}),
Document(page_content='The Dark Knight is a 2008 superhero film directed by Christopher Nolan.', metadata={'year': 2008, 'rating': '9.0', 'genre': 'superhero', 'director': 'Christopher Nolan'}),
Document(page_content='The Dark Knight is a 2008 superhero film directed by Christopher Nolan.', metadata={'year': 2008, 'rating': '9.0', 'genre': 'superhero', 'director': 'Christopher Nolan'}),
Document(page_content='Inception is a 2010 science fiction action film written and directed by Christopher Nolan.', metadata={'year': 2010, 'rating': '8.8', 'genre': 'science fiction', 'director': 'Christopher Nolan'})]