Google Vertex AI 向量搜尋

本筆記本展示如何使用與 Google Cloud Vertex AI 向量搜尋 向量資料庫相關的功能。

Google Vertex AI 向量搜尋，前身為 Vertex AI Matching Engine，提供業界領先的高規模低延遲向量資料庫。這些向量資料庫通常被稱為向量相似度比對或近似最近鄰 (ANN) 服務。

注意：LangChain API 預期已建立端點和已部署索引。索引建立時間可能長達一小時。

若要了解如何建立索引，請參閱章節建立索引並將其部署到端點
如果您已部署索引，請跳至從文本建立向量儲存

建立索引並將其部署到端點

本節示範如何建立新索引並將其部署到端點

# TODO : Set values as per your requirements
# Project and Storage Constants
PROJECT_ID = "<my_project_id>"
REGION = "<my_region>"
BUCKET = "<my_gcs_bucket>"
BUCKET_URI = f"gs://{BUCKET}"

# The number of dimensions for the textembedding-gecko@003 is 768
# If other embedder is used, the dimensions would probably need to change.
DIMENSIONS = 768

# Index Constants
DISPLAY_NAME = "<my_matching_engine_index_id>"
DEPLOYED_INDEX_ID = "<my_matching_engine_endpoint_id>"

# Create a bucket.
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

使用 VertexAIEmbeddings 作為嵌入模型

from google.cloud import aiplatform
from langchain_google_vertexai import VertexAIEmbeddings

API 參考文件：VertexAIEmbeddings

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

embedding_model = VertexAIEmbeddings(model_name="text-embedding-005")

建立空的索引

注意： 建立索引時，您應從 "BATCH_UPDATE" 或 "STREAM_UPDATE" 中指定 "index_update_method"

批次索引適用於當您想要以批次方式更新索引時，資料已在一段時間內儲存，例如每週或每月處理的系統。串流索引適用於當您希望在將新資料新增至資料儲存區時更新索引資料，例如，如果您經營書店並希望盡快在網路上顯示新庫存。選擇哪種類型非常重要，因為設定和需求有所不同。

有關配置索引的更多詳細資訊，請參閱官方文件

# NOTE : This operation can take upto 30 seconds
my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=DISPLAY_NAME,
    dimensions=DIMENSIONS,
    approximate_neighbors_count=150,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
    index_update_method="STREAM_UPDATE",  # allowed values BATCH_UPDATE , STREAM_UPDATE
)

建立端點

# Create an endpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=f"{DISPLAY_NAME}-endpoint", public_endpoint_enabled=True
)

將索引部署到端點

# NOTE : This operation can take upto 20 minutes
my_index_endpoint = my_index_endpoint.deploy_index(
    index=my_index, deployed_index_id=DEPLOYED_INDEX_ID
)

my_index_endpoint.deployed_indexes

從文本建立向量儲存

注意：如果您有現有的索引和端點，您可以使用以下程式碼載入它們

# TODO : replace 1234567890123456789 with your acutial index ID
my_index = aiplatform.MatchingEngineIndex("1234567890123456789")

# TODO : replace 1234567890123456789 with your acutial endpoint ID
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint("1234567890123456789")

from langchain_google_vertexai import (
    VectorSearchVectorStore,
    VectorSearchVectorStoreDatastore,
)

API 參考文件：VectorSearchVectorStore | VectorSearchVectorStoreDatastore

建立簡單的向量儲存 (不含篩選器)

# Input texts
texts = [
    "The cat sat on",
    "the mat.",
    "I like to",
    "eat pizza for",
    "dinner.",
    "The sun sets",
    "in the west.",
]

# Create a Vector Store
vector_store = VectorSearchVectorStore.from_components(
    project_id=PROJECT_ID,
    region=REGION,
    gcs_bucket_name=BUCKET,
    index_id=my_index.name,
    endpoint_id=my_index_endpoint.name,
    embedding=embedding_model,
    stream_update=True,
)

# Add vectors and mapped text chunks to your vectore store
vector_store.add_texts(texts=texts)

選用：您也可以建立向量並在資料儲存區中儲存區塊

# NOTE : This operation can take upto 20 mins
vector_store = VectorSearchVectorStoreDatastore.from_components(
    project_id=PROJECT_ID,
    region=REGION,
    index_id=my_index.name,
    endpoint_id=my_index_endpoint.name,
    embedding=embedding_model,
    stream_update=True,
)

vector_store.add_texts(texts=texts, is_complete_overwrite=True)

# Try running a simialarity search
vector_store.similarity_search("pizza")

建立具有中繼資料篩選器的向量儲存

# Input text with metadata
record_data = [
    {
        "description": "A versatile pair of dark-wash denim jeans."
        "Made from durable cotton with a classic straight-leg cut, these jeans"
        " transition easily from casual days to dressier occasions.",
        "price": 65.00,
        "color": "blue",
        "season": ["fall", "winter", "spring"],
    },
    {
        "description": "A lightweight linen button-down shirt in a crisp white."
        " Perfect for keeping cool with breathable fabric and a relaxed fit.",
        "price": 34.99,
        "color": "white",
        "season": ["summer", "spring"],
    },
    {
        "description": "A soft, chunky knit sweater in a vibrant forest green. "
        "The oversized fit and cozy wool blend make this ideal for staying warm "
        "when the temperature drops.",
        "price": 89.99,
        "color": "green",
        "season": ["fall", "winter"],
    },
    {
        "description": "A classic crewneck t-shirt in a soft, heathered blue. "
        "Made from comfortable cotton jersey, this t-shirt is a wardrobe essential "
        "that works for every season.",
        "price": 19.99,
        "color": "blue",
        "season": ["fall", "winter", "summer", "spring"],
    },
    {
        "description": "A flowing midi-skirt in a delicate floral print. "
        "Lightweight and airy, this skirt adds a touch of feminine style "
        "to warmer days.",
        "price": 45.00,
        "color": "white",
        "season": ["spring", "summer"],
    },
]

# Parse and prepare input data

texts = []
metadatas = []
for record in record_data:
    record = record.copy()
    page_content = record.pop("description")
    texts.append(page_content)
    if isinstance(page_content, str):
        metadata = {**record}
        metadatas.append(metadata)

# Inspect metadatas
metadatas

# NOTE : This operation can take more than 20 mins
vector_store = VectorSearchVectorStore.from_components(
    project_id=PROJECT_ID,
    region=REGION,
    gcs_bucket_name=BUCKET,
    index_id=my_index.name,
    endpoint_id=my_index_endpoint.name,
    embedding=embedding_model,
)

vector_store.add_texts(texts=texts, metadatas=metadatas, is_complete_overwrite=True)

from google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint import (
    Namespace,
    NumericNamespace,
)

# Try running a simple similarity search

# Below code should return 5 results
vector_store.similarity_search("shirt", k=5)

# Try running a similarity search with text filter
filters = [Namespace(name="season", allow_tokens=["spring"])]

# Below code should return 4 results now
vector_store.similarity_search("shirt", k=5, filter=filters)

# Try running a similarity search with combination of text and numeric filter
filters = [Namespace(name="season", allow_tokens=["spring"])]
numeric_filters = [NumericNamespace(name="price", value_float=40.0, op="LESS")]

# Below code should return 2 results now
vector_store.similarity_search(
    "shirt", k=5, filter=filters, numeric_filter=numeric_filters
)

將向量儲存用作檢索器

# Initialize the vectore_store as retriever
retriever = vector_store.as_retriever()

# perform simple similarity search on retriever
retriever.invoke("What are my options in breathable fabric?")

# Try running a similarity search with text filter
filters = [Namespace(name="season", allow_tokens=["spring"])]

retriever.search_kwargs = {"filter": filters}

# perform similarity search with filters on retriever
retriever.invoke("What are my options in breathable fabric?")

# Try running a similarity search with combination of text and numeric filter
filters = [Namespace(name="season", allow_tokens=["spring"])]
numeric_filters = [NumericNamespace(name="price", value_float=40.0, op="LESS")]


retriever.search_kwargs = {"filter": filters, "numeric_filter": numeric_filters}

retriever.invoke("What are my options in breathable fabric?")

在問答鏈中使用帶有檢索器的篩選器

from langchain_google_vertexai import VertexAI

llm = VertexAI(model_name="gemini-pro")

API 參考文件：VertexAI

from langchain.chains import RetrievalQA

filters = [Namespace(name="season", allow_tokens=["spring"])]
numeric_filters = [NumericNamespace(name="price", value_float=40.0, op="LESS")]

retriever.search_kwargs = {"k": 2, "filter": filters, "numeric_filter": numeric_filters}

retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
)

question = "What are my options in breathable fabric?"
response = retrieval_qa({"query": question})
print(f"{response['result']}")
print("REFERENCES")
print(f"{response['source_documents']}")

API 參考文件：RetrievalQA

讀取、分塊、向量化和索引 PDF

!pip install pypdf

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

API 參考文件：PyPDFLoader | RecursiveCharacterTextSplitter

loader = PyPDFLoader("https://arxiv.org/pdf/1706.03762.pdf")
pages = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)
doc_splits = text_splitter.split_documents(pages)

texts = [doc.page_content for doc in doc_splits]
metadatas = [doc.metadata for doc in doc_splits]

texts[0]

# Inspect Metadata of 1st page
metadatas[0]

vector_store = VectorSearchVectorStore.from_components(
    project_id=PROJECT_ID,
    region=REGION,
    gcs_bucket_name=BUCKET,
    index_id=my_index.name,
    endpoint_id=my_index_endpoint.name,
    embedding=embedding_model,
)

vector_store.add_texts(texts=texts, metadatas=metadatas, is_complete_overwrite=True)

vector_store = VectorSearchVectorStore.from_components(
    project_id=PROJECT_ID,
    region=REGION,
    gcs_bucket_name=BUCKET,
    index_id=my_index.name,
    endpoint_id=my_index_endpoint.name,
    embedding=embedding_model,
)

混合搜尋

向量搜尋支援混合搜尋，這是一種資訊檢索 (IR) 中流行的架構模式，它結合了語意搜尋和關鍵字搜尋（也稱為基於符記的搜尋）。借助混合搜尋，開發人員可以充分利用這兩種方法的優勢，有效地提供更高的搜尋品質。點擊此處以了解更多資訊。

為了使用混合搜尋，我們需要擬合稀疏嵌入向量化器，並在向量搜尋整合之外處理嵌入。稀疏嵌入向量化器的一個範例是 sklearn TfidfVectorizer，但也可以使用其他技術，例如 BM25。

# Define some sample data
texts = [
    "The cat sat on",
    "the mat.",
    "I like to",
    "eat pizza for",
    "dinner.",
    "The sun sets",
    "in the west.",
]

# optional IDs
ids = ["i_" + str(i + 1) for i in range(len(texts))]

# optional metadata
metadatas = [{"my_metadata": i} for i in range(len(texts))]

from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TFIDF Vectorizer (This is usually done on a very large corpus of data to make sure that word statistics generalize well on new data)
vectorizer = TfidfVectorizer()
vectorizer.fit(texts)

# Utility function to transform text into a TF-IDF Sparse Vector
def get_sparse_embedding(tfidf_vectorizer, text):
    tfidf_vector = tfidf_vectorizer.transform([text])
    values = []
    dims = []
    for i, tfidf_value in enumerate(tfidf_vector.data):
        values.append(float(tfidf_value))
        dims.append(int(tfidf_vector.indices[i]))
    return {"values": values, "dimensions": dims}

# semantic (dense) embeddings
embeddings = embedding_model.embed_documents(texts)
# tfidf (sparse) embeddings
sparse_embeddings = [get_sparse_embedding(vectorizer, x) for x in texts]

sparse_embeddings[0]

# Add the dense and sparse embeddings in Vector Search

vector_store.add_texts_with_embeddings(
    texts=texts,
    embeddings=embeddings,
    sparse_embeddings=sparse_embeddings,
    ids=ids,
    metadatas=metadatas,
)

# Run hybrid search
query = "the cat"
embedding = embedding_model.embed_query(query)
sparse_embedding = get_sparse_embedding(vectorizer, query)

vector_store.similarity_search_by_vector_with_score(
    embedding=embedding,
    sparse_embedding=sparse_embedding,
    k=5,
    rrf_ranking_alpha=0.7,  # 0.7 weight to dense and 0.3 weight to sparse
)

向量儲存概念指南
向量儲存操作指南

建立索引並將其部署到端點​

使用 VertexAIEmbeddings 作為嵌入模型​

建立空的索引​

建立端點​

將索引部署到端點​

從文本建立向量儲存​

建立簡單的向量儲存 (不含篩選器)​

選用：您也可以建立向量並在資料儲存區中儲存區塊​

建立具有中繼資料篩選器的向量儲存​

將向量儲存用作檢索器​

在問答鏈中使用帶有檢索器的篩選器​

讀取、分塊、向量化和索引 PDF​

混合搜尋​

相關內容​

此頁面是否對您有幫助？