Milvus

Milvus 是一個資料庫，用於儲存、索引和管理由深度神經網路和其他機器學習 (ML) 模型產生的大量嵌入向量。

本筆記本展示如何使用與 Milvus 向量資料庫相關的功能。

設定

您需要使用 pip install -qU langchain-milvus 安裝 langchain-milvus 才能使用此整合。

pip install -qU langchain_milvus

Note: you may need to restart the kernel to use updated packages.

憑證

使用 Milvus 向量儲存庫無需憑證。

初始化

選擇嵌入模型

pip install -qU langchain-openai

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

Milvus Lite

原型設計最簡單的方式是使用 Milvus Lite，其中所有內容都儲存在本機向量資料庫檔案中。僅能使用 Flat 索引。

from langchain_milvus import Milvus

URI = "./milvus_example.db"

vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri": URI},
    index_params={"index_type": "FLAT", "metric_type": "L2"},
)

API 參考：Milvus

Milvus 伺服器

如果您有大量資料（例如，超過一百萬個向量），我們建議在 Docker 或 Kubernetes 上設定效能更高的 Milvus 伺服器。

Milvus 伺服器提供對各種索引的支援。利用這些不同的索引可以顯著增強檢索能力並加快檢索過程，以滿足您的特定需求。

舉例來說，以 Milvus Standalone 為例。若要啟動 Docker 容器，您可以執行以下命令

!curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh -o standalone_embed.sh

!bash standalone_embed.sh start

Password:

在這裡，我們建立一個 Milvus 資料庫

from pymilvus import Collection, MilvusException, connections, db, utility

conn = connections.connect(host="127.0.0.1", port=19530)

# Check if the database exists
db_name = "milvus_demo"
try:
    existing_databases = db.list_database()
    if db_name in existing_databases:
        print(f"Database '{db_name}' already exists.")

        # Use the database context
        db.using_database(db_name)

        # Drop all collections in the database
        collections = utility.list_collections()
        for collection_name in collections:
            collection = Collection(name=collection_name)
            collection.drop()
            print(f"Collection '{collection_name}' has been dropped.")

        db.drop_database(db_name)
        print(f"Database '{db_name}' has been deleted.")
    else:
        print(f"Database '{db_name}' does not exist.")
        database = db.create_database(db_name)
        print(f"Database '{db_name}' created successfully.")
except MilvusException as e:
    print(f"An error occurred: {e}")

Database 'milvus_demo' does not exist.
Database 'milvus_demo' created successfully.

請注意下方 URI 的變更。執行個體初始化後，導覽至 http://127.0.0.1:9091/webui 以檢視本機 Web UI。

以下是如何使用 Milvus 資料庫服務建立向量儲存庫執行個體的範例

from langchain_milvus import BM25BuiltInFunction, Milvus

URI = "https://127.0.0.1:19530"

vectorstore = Milvus(
    embedding_function=embeddings,
    connection_args={"uri": URI, "token": "root:Milvus", "db_name": "milvus_demo"},
    index_params={"index_type": "FLAT", "metric_type": "L2"},
    consistency_level="Strong",
    drop_old=False,  # set to True if seeking to drop the collection with that name if it exists
)

API 參考：BM25BuiltInFunction | Milvus

如果您想使用 Zilliz Cloud，即 Milvus 的全託管雲端服務，請調整 uri 和 token，它們分別對應於 Zilliz Cloud 中的 Public Endpoint 和 Api key。

使用 Milvus 集合劃分資料

您可以將不相關的文件儲存在同一個 Milvus 執行個體內的不同集合中。

以下是如何建立新集合

from langchain_core.documents import Document

vector_store_saved = Milvus.from_documents(
    [Document(page_content="foo!")],
    embeddings,
    collection_name="langchain_example",
    connection_args={"uri": URI},
)

API 參考：Document

以下是如何檢索已儲存的集合

vector_store_loaded = Milvus(
    embeddings,
    connection_args={"uri": URI},
    collection_name="langchain_example",
)

管理向量儲存庫

建立向量儲存庫後，我們可以透過新增和刪除不同的項目來與之互動。

將項目新增至向量儲存庫

我們可以使用 add_documents 函數將項目新增至向量儲存庫。

from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

API 參考：Document

從向量儲存庫刪除項目

vector_store.delete(ids=[uuids[-1]])

(insert count: 0, delete count: 1, upsert count: 0, timestamp: 0, success count: 0, err count: 0, cost: 0)

查詢向量儲存庫

建立向量儲存庫並新增相關文件後，您很可能會希望在執行鏈或代理程式期間查詢它。

直接查詢

相似性搜尋

可以使用以下方式執行簡單的相似性搜尋，並依據中繼資料進行篩選

results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=2,
    expr='source == "tweet"',
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* Building an exciting new project with LangChain - come check it out! [{'pk': '9905001c-a4a3-455e-ab94-72d0ed11b476', 'source': 'tweet'}]
* LangGraph is the best framework for building stateful, agentic applications! [{'pk': '1206d237-ee3a-484f-baf2-b5ac38eeb314', 'source': 'tweet'}]

具有分數的相似性搜尋

您也可以使用分數進行搜尋

results = vector_store.similarity_search_with_score(
    "Will it be hot tomorrow?", k=1, expr='source == "news"'
)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

* [SIM=21192.628906] bar [{'pk': '2', 'source': 'https://example.com'}]

如需使用 Milvus 向量儲存庫時可用的所有搜尋選項的完整清單，您可以造訪 API 參考。

透過轉換為檢索器來查詢

您也可以將向量儲存庫轉換為檢索器，以便在您的鏈中更輕鬆地使用。

retriever = vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 1})
retriever.invoke("Stealing from the bank is a crime", filter={"source": "news"})

[Document(metadata={'pk': 'eacc7256-d7fa-4036-b1f7-83d7a4bee0c5', 'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]

混合搜尋

最常見的混合搜尋情境是密集 + 稀疏混合搜尋，其中候選項目是透過語意向量相似度和精確關鍵字比對來檢索的。這些方法的結果會被合併、重新排序，並傳遞給 LLM 以產生最終答案。這種方法平衡了精確度和語意理解，使其在各種查詢情境中都非常有效。

全文檢索

自 Milvus 2.5 以來，透過 Sparse-BM25 方法原生支援全文檢索，方法是將 BM25 演算法表示為稀疏向量。Milvus 接受原始文字作為輸入，並自動將其轉換為儲存在指定欄位中的稀疏向量，無需手動產生稀疏嵌入。

對於全文檢索，Milvus VectorStore 接受 builtin_function 參數。透過此參數，您可以傳入 BM25BuiltInFunction 的執行個體。這與語意搜尋不同，語意搜尋通常將密集嵌入傳遞至 VectorStore，

以下是在 Milvus 中使用 OpenAI 密集嵌入進行語意搜尋，以及使用 BM25 進行全文檢索的混合搜尋的簡單範例

from langchain_milvus import BM25BuiltInFunction, Milvus
from langchain_openai import OpenAIEmbeddings

vectorstore = Milvus.from_documents(
    documents=documents,
    embedding=OpenAIEmbeddings(),
    builtin_function=BM25BuiltInFunction(),
    # `dense` is for OpenAI embeddings, `sparse` is the output field of BM25 function
    vector_field=["dense", "sparse"],
    connection_args={
        "uri": URI,
    },
    consistency_level="Strong",
    drop_old=True,
)

API 參考：BM25BuiltInFunction | Milvus | OpenAIEmbeddings

當您使用 BM25BuiltInFunction 時，請注意全文檢索在 Milvus Standalone 和 Milvus Distributed 中可用，但在 Milvus Lite 中不可用，儘管它已在未來包含的藍圖中。它也將很快在 Zilliz Cloud（全託管 Milvus）中提供。請聯絡 support@zilliz.com 以取得更多資訊。

在上述程式碼中，我們定義了 BM25BuiltInFunction 的執行個體，並將其傳遞給 Milvus 物件。BM25BuiltInFunction 是 Milvus 中 Function 的輕量級封裝器類別。我們可以將它與 OpenAIEmbeddings 一起使用，以初始化密集 + 稀疏混合搜尋 Milvus 向量儲存庫執行個體。

BM25BuiltInFunction 不需要用戶傳遞語料庫或訓練，所有內容都會在 Milvus 伺服器端自動處理，因此用戶無需關心任何詞彙和語料庫。此外，用戶還可以自訂 analyzer 以在 BM25 中實作自訂文字處理。

重新排序候選項目

在第一階段檢索之後，我們需要重新排序候選項目以獲得更好的結果。您可以參考重新排序以取得更多資訊。

以下是加權重新排序的範例

query = "What are the novels Lila has written and what are their contents?"

vectorstore.similarity_search(
    query, k=1, ranker_type="weighted", ranker_params={"weights": [0.6, 0.4]}
)

如需有關全文檢索和混合搜尋的更多資訊，請參考使用 LangChain 和 Milvus 進行全文檢索和使用 LangChain 和 Milvus 進行混合檢索。

用於檢索增強生成的使用方式

如需如何使用此向量儲存庫進行檢索增強生成 (RAG) 的指南，請參閱以下章節

每個使用者的檢索

在建置檢索應用程式時，您通常必須以多個使用者為考量進行建置。這表示您可能不僅儲存一位使用者的資料，還儲存許多不同使用者的資料，而且他們不應能夠看到彼此的資料。

Milvus 建議使用 partition_key 來實作多租戶。以下是一個範例

分割鍵功能在 Milvus Lite 中不可用，如果您想使用它，您需要啟動 Milvus 伺服器，如上所述。

from langchain_core.documents import Document

docs = [
    Document(page_content="i worked at kensho", metadata={"namespace": "harrison"}),
    Document(page_content="i worked at facebook", metadata={"namespace": "ankush"}),
]
vectorstore = Milvus.from_documents(
    docs,
    embeddings,
    connection_args={"uri": URI},
    drop_old=True,
    partition_key_field="namespace",  # Use the "namespace" field as the partition key
)

API 參考：Document

若要使用分割鍵執行搜尋，您應在搜尋請求的布林運算式中包含以下任一項

search_kwargs={"expr": '<partition_key> == "xxxx"'}

search_kwargs={"expr": '<partition_key> == in ["xxx", "xxx"]'}

請將 <partition_key> 替換為指定為分割鍵的欄位名稱。

Milvus 會根據指定的分割鍵切換到分割區，根據分割鍵篩選實體，並在篩選後的實體中進行搜尋。

# This will only get documents for Ankush
vectorstore.as_retriever(search_kwargs={"expr": 'namespace == "ankush"'}).invoke(
    "where did i work?"
)

[Document(page_content='i worked at facebook', metadata={'namespace': 'ankush'})]

# This will only get documents for Harrison
vectorstore.as_retriever(search_kwargs={"expr": 'namespace == "harrison"'}).invoke(
    "where did i work?"
)

[Document(page_content='i worked at kensho', metadata={'namespace': 'harrison'})]

API 參考

如需所有 __ModuleName__VectorStore 功能和設定的詳細文件，請前往 API 參考： https://langchain-python.dev.org.tw/api_reference/milvus/vectorstores/langchain_milvus.vectorstores.milvus.Milvus.html

向量資料庫概念指南
向量資料庫操作指南

設定​

憑證​

初始化​

Milvus Lite​

Milvus 伺服器​

使用 Milvus 集合劃分資料​

管理向量儲存庫​

將項目新增至向量儲存庫​

從向量儲存庫刪除項目​

查詢向量儲存庫​

直接查詢​

相似性搜尋​

具有分數的相似性搜尋​

透過轉換為檢索器來查詢​

混合搜尋​

全文檢索​

重新排序候選項目​

用於檢索增強生成的使用方式​

每個使用者的檢索​

API 參考​

相關連結​

此頁面是否對您有幫助？

設定

憑證

初始化

Milvus Lite

Milvus 伺服器

使用 Milvus 集合劃分資料

管理向量儲存庫

將項目新增至向量儲存庫

從向量儲存庫刪除項目

查詢向量儲存庫

直接查詢

相似性搜尋

具有分數的相似性搜尋

透過轉換為檢索器來查詢

混合搜尋

全文檢索

重新排序候選項目

用於檢索增強生成的使用方式

每個使用者的檢索

API 參考

相關連結