Milvus 混合搜尋檢索器

Milvus 是一個開源向量資料庫，旨在為嵌入相似性搜尋和 AI 應用程式提供動力。 Milvus 使非結構化資料搜尋更易於存取，並提供一致的使用者體驗，無論部署環境為何。

這將幫助您開始使用 Milvus 混合搜尋檢索器，它結合了密集和稀疏向量搜尋的優勢。如需所有 MilvusCollectionHybridSearchRetriever 功能和配置的詳細文件，請前往 API 參考。

另請參閱 Milvus 多向量搜尋文件。

整合詳細資訊

檢索器	自託管	雲端服務	套件
MilvusCollectionHybridSearchRetriever	✅	❌	langchain_milvus

設定

如果您想從個別查詢取得自動追蹤，您也可以透過取消註解下方內容來設定您的 LangSmith API 金鑰

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

安裝

此檢索器位於 langchain-milvus 套件中。本指南需要以下依賴項

%pip install --upgrade --quiet pymilvus[model] langchain-milvus langchain-openai

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_milvus.retrievers import MilvusCollectionHybridSearchRetriever
from langchain_milvus.utils.sparse import BM25SparseEmbedding
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from pymilvus import (
    Collection,
    CollectionSchema,
    DataType,
    FieldSchema,
    WeightedRanker,
    connections,
)

啟動 Milvus 服務

請參閱 Milvus 文件以啟動 Milvus 服務。

啟動 Milvus 後，您需要指定您的 Milvus 連線 URI。

CONNECTION_URI = "https://127.0.0.1:19530"

準備 OpenAI API 金鑰

請參閱 OpenAI 文件以取得您的 OpenAI API 金鑰，並將其設定為環境變數。

export OPENAI_API_KEY=<your_api_key>

準備密集和稀疏嵌入函式

讓我們虛構 10 個虛假的小說描述。在實際生產中，它可能是一個大量的文字資料。

texts = [
    "In 'The Whispering Walls' by Ava Moreno, a young journalist named Sophia uncovers a decades-old conspiracy hidden within the crumbling walls of an ancient mansion, where the whispers of the past threaten to destroy her own sanity.",
    "In 'The Last Refuge' by Ethan Blackwood, a group of survivors must band together to escape a post-apocalyptic wasteland, where the last remnants of humanity cling to life in a desperate bid for survival.",
    "In 'The Memory Thief' by Lila Rose, a charismatic thief with the ability to steal and manipulate memories is hired by a mysterious client to pull off a daring heist, but soon finds themselves trapped in a web of deceit and betrayal.",
    "In 'The City of Echoes' by Julian Saint Clair, a brilliant detective must navigate a labyrinthine metropolis where time is currency, and the rich can live forever, but at a terrible cost to the poor.",
    "In 'The Starlight Serenade' by Ruby Flynn, a shy astronomer discovers a mysterious melody emanating from a distant star, which leads her on a journey to uncover the secrets of the universe and her own heart.",
    "In 'The Shadow Weaver' by Piper Redding, a young orphan discovers she has the ability to weave powerful illusions, but soon finds herself at the center of a deadly game of cat and mouse between rival factions vying for control of the mystical arts.",
    "In 'The Lost Expedition' by Caspian Grey, a team of explorers ventures into the heart of the Amazon rainforest in search of a lost city, but soon finds themselves hunted by a ruthless treasure hunter and the treacherous jungle itself.",
    "In 'The Clockwork Kingdom' by Augusta Wynter, a brilliant inventor discovers a hidden world of clockwork machines and ancient magic, where a rebellion is brewing against the tyrannical ruler of the land.",
    "In 'The Phantom Pilgrim' by Rowan Welles, a charismatic smuggler is hired by a mysterious organization to transport a valuable artifact across a war-torn continent, but soon finds themselves pursued by deadly assassins and rival factions.",
    "In 'The Dreamwalker's Journey' by Lyra Snow, a young dreamwalker discovers she has the ability to enter people's dreams, but soon finds herself trapped in a surreal world of nightmares and illusions, where the boundaries between reality and fantasy blur.",
]

我們將使用 OpenAI Embedding 來產生密集向量，並使用 BM25 演算法來產生稀疏向量。

初始化密集嵌入函式並取得維度

dense_embedding_func = OpenAIEmbeddings()
dense_dim = len(dense_embedding_func.embed_query(texts[1]))
dense_dim

初始化稀疏嵌入函式。

請注意，稀疏嵌入的輸出是一組稀疏向量，代表輸入文字的關鍵字索引和權重。

sparse_embedding_func = BM25SparseEmbedding(corpus=texts)
sparse_embedding_func.embed_query(texts[1])

{0: 0.4270424944042204,
1.845826690498331,
1.845826690498331,
1.845826690498331,
1.845826690498331,
1.845826690498331,
1.845826690498331,
1.2237754316221157,
1.845826690498331,
1.845826690498331,
1.845826690498331,
1.845826690498331,
1.845826690498331,
1.845826690498331,
1.845826690498331,
1.845826690498331,
1.845826690498331,
1.845826690498331,
1.845826690498331,
1.845826690498331}

建立 Milvus 集合並載入資料

初始化連線 URI 並建立連線

connections.connect(uri=CONNECTION_URI)

定義欄位名稱及其資料類型

pk_field = "doc_id"
dense_field = "dense_vector"
sparse_field = "sparse_vector"
text_field = "text"
fields = [
    FieldSchema(
        name=pk_field,
        dtype=DataType.VARCHAR,
        is_primary=True,
        auto_id=True,
        max_length=100,
    ),
    FieldSchema(name=dense_field, dtype=DataType.FLOAT_VECTOR, dim=dense_dim),
    FieldSchema(name=sparse_field, dtype=DataType.SPARSE_FLOAT_VECTOR),
    FieldSchema(name=text_field, dtype=DataType.VARCHAR, max_length=65_535),
]

使用定義的結構描述建立集合

schema = CollectionSchema(fields=fields, enable_dynamic_field=False)
collection = Collection(
    name="IntroductionToTheNovels", schema=schema, consistency_level="Strong"
)

定義密集和稀疏向量的索引

dense_index = {"index_type": "FLAT", "metric_type": "IP"}
collection.create_index("dense_vector", dense_index)
sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"}
collection.create_index("sparse_vector", sparse_index)
collection.flush()

將實體插入集合並載入集合

entities = []
for text in texts:
    entity = {
        dense_field: dense_embedding_func.embed_documents([text])[0],
        sparse_field: sparse_embedding_func.embed_documents([text])[0],
        text_field: text,
    }
    entities.append(entity)
collection.insert(entities)
collection.load()

例項化

現在我們可以例項化我們的檢索器，為稀疏和密集欄位定義搜尋參數

sparse_search_params = {"metric_type": "IP"}
dense_search_params = {"metric_type": "IP", "params": {}}
retriever = MilvusCollectionHybridSearchRetriever(
    collection=collection,
    rerank=WeightedRanker(0.5, 0.5),
    anns_fields=[dense_field, sparse_field],
    field_embeddings=[dense_embedding_func, sparse_embedding_func],
    field_search_params=[dense_search_params, sparse_search_params],
    top_k=3,
    text_field=text_field,
)

在此檢索器的輸入參數中，我們使用密集嵌入和稀疏嵌入來對此集合的兩個欄位執行混合搜尋，並使用 WeightedRanker 進行重新排序。最後，將傳回 3 個前 K 名文件。

使用方式

retriever.invoke("What are the story about ventures?")

[Document(page_content="In 'The Lost Expedition' by Caspian Grey, a team of explorers ventures into the heart of the Amazon rainforest in search of a lost city, but soon finds themselves hunted by a ruthless treasure hunter and the treacherous jungle itself.", metadata={'doc_id': '449281835035545843'}),
 Document(page_content="In 'The Phantom Pilgrim' by Rowan Welles, a charismatic smuggler is hired by a mysterious organization to transport a valuable artifact across a war-torn continent, but soon finds themselves pursued by deadly assassins and rival factions.", metadata={'doc_id': '449281835035545845'}),
 Document(page_content="In 'The Dreamwalker's Journey' by Lyra Snow, a young dreamwalker discovers she has the ability to enter people's dreams, but soon finds herself trapped in a surreal world of nightmares and illusions, where the boundaries between reality and fantasy blur.", metadata={'doc_id': '449281835035545846'})]

在鏈中使用

初始化 ChatOpenAI 並定義提示範本

llm = ChatOpenAI()

PROMPT_TEMPLATE = """
Human: You are an AI assistant, and provides answers to questions by using fact based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.

<context>
{context}
</context>

<question>
{question}
</question>

Assistant:"""

prompt = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)

定義格式化文件的函式

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

使用檢索器和其他組件定義鏈

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

使用定義的鏈執行查詢

rag_chain.invoke("What novels has Lila written and what are their contents?")

"Lila Rose has written 'The Memory Thief,' which follows a charismatic thief with the ability to steal and manipulate memories as they navigate a daring heist and a web of deceit and betrayal."

捨棄集合

collection.drop()

API 參考

如需所有 MilvusCollectionHybridSearchRetriever 功能和配置的詳細文件，請前往 API 參考。

檢索器概念指南
檢索器操作指南

整合詳細資訊​

設定​

安裝​

啟動 Milvus 服務​

準備 OpenAI API 金鑰​

準備密集和稀疏嵌入函式​

建立 Milvus 集合並載入資料​

例項化​

使用方式​

在鏈中使用​

API 參考​

相關內容​

此頁面是否有幫助？