Elasticsearch

Elasticsearch 是一個分散式、RESTful 的搜尋和分析引擎，能夠執行向量和詞彙搜尋。它建立在 Apache Lucene 程式庫之上。

本筆記本展示如何使用與 Elasticsearch 向量儲存相關的功能。

設定

為了使用 Elasticsearch 向量搜尋，您必須安裝 langchain-elasticsearch 套件。

%pip install -qU langchain-elasticsearch

憑證

有兩種主要方式可以設定 Elasticsearch 實例以搭配使用：

Elastic Cloud：Elastic Cloud 是一種託管的 Elasticsearch 服務。註冊免費試用。

若要連線到不需要登入憑證的 Elasticsearch 實例（以啟用安全性啟動 docker 實例），請將 Elasticsearch URL 和索引名稱以及嵌入物件傳遞給建構函式。

本地安裝 Elasticsearch：透過在本地執行 Elasticsearch 開始使用。最簡單的方式是使用官方 Elasticsearch Docker 映像檔。請參閱 Elasticsearch Docker 文件以取得更多資訊。

透過 Docker 執行 Elasticsearch

範例：執行安全性停用的單節點 Elasticsearch 實例。不建議用於生產環境。

%docker run -p 9200:9200 -e "discovery.type=single-node" -e "xpack.security.enabled=false" -e "xpack.security.http.ssl.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.12.1

使用驗證執行

對於生產環境，我們建議您啟用安全性執行。若要使用登入憑證連線，您可以使用參數 es_api_key 或 es_user 和 es_password。

選擇嵌入模型

pip install -qU langchain-openai

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

from langchain_elasticsearch import ElasticsearchStore

elastic_vector_search = ElasticsearchStore(
    es_url="https://127.0.0.1:9200",
    index_name="langchain_index",
    embedding=embeddings,
    es_user="elastic",
    es_password="changeme",
)

API 參考：ElasticsearchStore

如何取得預設 "elastic" 使用者的密碼？

若要取得預設 "elastic" 使用者的 Elastic Cloud 密碼：

登入 Elastic Cloud 主控台：https://cloud.elastic.co
前往 "安全性" > "使用者"
找到 "elastic" 使用者，然後按一下 "編輯"
按一下 "重設密碼"
依照提示重設密碼

如何取得 API 金鑰？

若要取得 API 金鑰：

登入 Elastic Cloud 主控台：https://cloud.elastic.co
開啟 Kibana 並前往 Stack Management > API 金鑰
按一下 "建立 API 金鑰"
輸入 API 金鑰的名稱，然後按一下 "建立"
複製 API 金鑰，並將其貼到 api_key 參數中

Elastic Cloud

若要連線到 Elastic Cloud 上的 Elasticsearch 實例，您可以使用 es_cloud_id 參數或 es_url。

elastic_vector_search = ElasticsearchStore(
    es_cloud_id="<cloud_id>",
    index_name="test_index",
    embedding=embeddings,
    es_user="elastic",
    es_password="changeme",
)

如果您想要取得一流的模型呼叫自動追蹤，您也可以透過取消註解下方內容來設定您的 LangSmith API 金鑰

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

初始化

Elasticsearch 正在本機 localhost:9200 上透過 docker 執行。如需如何從 Elastic Cloud 連線到 Elasticsearch 的更多詳細資訊，請參閱上方的使用驗證連線。

from langchain_elasticsearch import ElasticsearchStore

vector_store = ElasticsearchStore(
    "langchain-demo", embedding=embeddings, es_url="https://127.0.0.1:9201"
)

API 參考：ElasticsearchStore

管理向量儲存

將項目新增至向量儲存

from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocalate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

API 參考：Document

['21cca03c-9089-42d2-b41c-3d156be2b519',
 'a6ceb967-b552-4802-bb06-c0e95fce386e',
 '3a35fac4-e5f0-493b-bee0-9143b41aedae',
 '176da099-66b1-4d6a-811b-dfdfe0808d30',
 'ecfa1a30-3c97-408b-80c0-5c43d68bf5ff',
 'c0f08baa-e70b-4f83-b387-c6e0a0f36f73',
 '489b2c9c-1925-43e1-bcf0-0fa94cf1cbc4',
 '408c6503-9ba4-49fd-b1cc-95584cd914c5',
 '5248c899-16d5-4377-a9e9-736ca443ad4f',
 'ca182769-c4fc-4e25-8f0a-8dd0a525955c']

從向量儲存刪除項目

vector_store.delete(ids=[uuids[-1]])

True

查詢向量儲存

一旦您的向量儲存已建立且已新增相關文件，您很可能會希望在執行您的鏈或代理程式期間查詢它。這些範例也示範如何在搜尋時使用篩選。

直接查詢

相似性搜尋

使用中繼資料篩選執行簡單的相似性搜尋可以如下進行：

results = vector_store.similarity_search(
    query="LangChain provides abstractions to make working with LLMs easy",
    k=2,
    filter=[{"term": {"metadata.source.keyword": "tweet"}}],
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]

具有分數的相似性搜尋

如果您想要執行相似性搜尋並接收對應的分數，您可以執行：

results = vector_store.similarity_search_with_score(
    query="Will it be hot tomorrow",
    k=1,
    filter=[{"term": {"metadata.source.keyword": "news"}}],
)
for doc, score in results:
    print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")

* [SIM=0.765887] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]

透過轉換為檢索器進行查詢

您也可以將向量儲存轉換為檢索器，以便在您的鏈中更輕鬆地使用。

retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.2}
)
retriever.invoke("Stealing from the bank is a crime")

[Document(metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.'),
 Document(metadata={'source': 'news'}, page_content='The stock market is down 500 points today due to fears of a recession.'),
 Document(metadata={'source': 'website'}, page_content='Is the new iPhone worth the price? Read this review to find out.'),
 Document(metadata={'source': 'tweet'}, page_content='Building an exciting new project with LangChain - come check it out!')]

用於檢索增強生成的使用方式

如需如何將此向量儲存用於檢索增強生成 (RAG) 的指南，請參閱以下章節：

常見問題

問題：將文件索引到 Elasticsearch 時，我收到逾時錯誤。我該如何修正此問題？

一個可能的問題是您的文件可能需要更長的時間才能索引到 Elasticsearch 中。ElasticsearchStore 使用 Elasticsearch 批量 API，其中有一些預設值您可以調整以減少逾時錯誤的機率。

當您使用 SparseVectorRetrievalStrategy 時，這也是一個好主意。

預設值為：

chunk_size: 500
max_chunk_bytes：100MB

若要調整這些值，您可以將 chunk_size 和 max_chunk_bytes 參數傳遞至 ElasticsearchStore 的 add_texts 方法。

    vector_store.add_texts(
        texts,
        bulk_kwargs={
            "chunk_size": 50,
            "max_chunk_bytes": 200000000
        }
    )

升級至 ElasticsearchStore

如果您已在基於 langchain 的專案中使用 Elasticsearch，您可能正在使用舊的實作：ElasticVectorSearch 和 ElasticKNNSearch，這些實作現在已棄用。我們引入了一個名為 ElasticsearchStore 的新實作，它更靈活且更易於使用。本筆記本將引導您完成升級至新實作的過程。

新功能？

新的實作現在是一個名為 ElasticsearchStore 的類別，可用於近似密集向量、精確密集向量、稀疏向量 (ELSER)、BM25 檢索和混合檢索，透過策略。

我正在使用 ElasticKNNSearch

舊實作

from langchain_community.vectorstores.elastic_vector_search import ElasticKNNSearch

db = ElasticKNNSearch(
  elasticsearch_url="https://127.0.0.1:9200",
  index_name="test_index",
  embedding=embedding
)

新實作

from langchain_elasticsearch import ElasticsearchStore, DenseVectorStrategy

db = ElasticsearchStore(
  es_url="https://127.0.0.1:9200",
  index_name="test_index",
  embedding=embedding,
  # if you use the model_id
  # strategy=DenseVectorStrategy(model_id="test_model")
  # if you use hybrid search
  # strategy=DenseVectorStrategy(hybrid=True)
)

API 參考文檔：ElasticsearchStore | DenseVectorStrategy

我正在使用 ElasticVectorSearch

舊實作

from langchain_community.vectorstores.elastic_vector_search import ElasticVectorSearch

db = ElasticVectorSearch(
  elasticsearch_url="https://127.0.0.1:9200",
  index_name="test_index",
  embedding=embedding
)

API 參考文檔：ElasticVectorSearch

新實作

from langchain_elasticsearch import ElasticsearchStore, DenseVectorScriptScoreStrategy

db = ElasticsearchStore(
  es_url="https://127.0.0.1:9200",
  index_name="test_index",
  embedding=embedding,
  strategy=DenseVectorScriptScoreStrategy()
)

API 參考文檔：ElasticsearchStore | DenseVectorScriptScoreStrategy

db.client.indices.delete(
    index="test-metadata, test-elser, test-basic",
    ignore_unavailable=True,
    allow_no_indices=True,
)

API 參考

如需 ElasticSearchStore 所有功能和配置的詳細文檔，請前往 API 參考文檔： https://langchain-python.dev.org.tw/api_reference/elasticsearch/vectorstores/langchain_elasticsearch.vectorstores.ElasticsearchStore.html

向量資料庫概念指南
向量資料庫操作指南

設定​

憑證​

透過 Docker 執行 Elasticsearch​

使用驗證執行​

如何取得預設 "elastic" 使用者的密碼？​

如何取得 API 金鑰？​

Elastic Cloud​

初始化​

管理向量儲存​

將項目新增至向量儲存​

從向量儲存刪除項目​

查詢向量儲存​

直接查詢​

相似性搜尋​

具有分數的相似性搜尋​

透過轉換為檢索器進行查詢​

用於檢索增強生成的使用方式​