OpenSearch
OpenSearch 是一個可擴展、靈活且可擴充的開源軟體套件,適用於搜尋、分析和可觀測性應用程式,並以 Apache 2.0 授權。
OpenSearch
是一個基於Apache Lucene
的分散式搜尋和分析引擎。
本筆記本展示如何使用與 OpenSearch
資料庫相關的功能。
要執行,您應該有一個正在運行的 OpenSearch 實例:請參閱此處以獲取簡單的 Docker 安裝。
similarity_search
預設執行近似 k-NN 搜尋,該搜尋使用多種演算法之一,例如 lucene、nmslib、faiss,這些演算法推薦用於大型資料集。 若要執行暴力搜尋,我們還有其他搜尋方法,稱為 Script Scoring 和 Painless Scripting。 請查看這裡以獲取更多詳細資訊。
安裝
安裝 Python 客戶端。
%pip install --upgrade --quiet opensearch-py langchain-community
我們想要使用 OpenAIEmbeddings,所以我們必須取得 OpenAI API 金鑰。
import getpass
import os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import OpenSearchVectorSearch
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
API 參考:TextLoader
使用近似 k-NN 的 similarity_search
similarity_search
使用具有自定義參數的 Approximate k-NN
搜尋
docsearch = OpenSearchVectorSearch.from_documents(
docs, embeddings, opensearch_url="https://127.0.0.1:9200"
)
# If using the default Docker installation, use this instantiation instead:
# docsearch = OpenSearchVectorSearch.from_documents(
# docs,
# embeddings,
# opensearch_url="https://127.0.0.1:9200",
# http_auth=("admin", "admin"),
# use_ssl = False,
# verify_certs = False,
# ssl_assert_hostname = False,
# ssl_show_warn = False,
# )
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query, k=10)
print(docs[0].page_content)
docsearch = OpenSearchVectorSearch.from_documents(
docs,
embeddings,
opensearch_url="https://127.0.0.1:9200",
engine="faiss",
space_type="innerproduct",
ef_construction=256,
m=48,
)
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)
print(docs[0].page_content)
使用 Script Scoring 的 similarity_search
similarity_search
使用具有自定義參數的 Script Scoring
docsearch = OpenSearchVectorSearch.from_documents(
docs, embeddings, opensearch_url="https://127.0.0.1:9200", is_appx_search=False
)
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(
"What did the president say about Ketanji Brown Jackson",
k=1,
search_type="script_scoring",
)
print(docs[0].page_content)
使用 Painless Scripting 的 similarity_search
similarity_search
使用具有自定義參數的 Painless Scripting
docsearch = OpenSearchVectorSearch.from_documents(
docs, embeddings, opensearch_url="https://127.0.0.1:9200", is_appx_search=False
)
filter = {"bool": {"filter": {"term": {"text": "smuggling"}}}}
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(
"What did the president say about Ketanji Brown Jackson",
search_type="painless_scripting",
space_type="cosineSimilarity",
pre_filter=filter,
)
print(docs[0].page_content)
最大邊際相關性搜尋 (MMR)
如果您想尋找一些相似的文件,但您也希望收到多樣化的結果,MMR 是您應該考慮的方法。 最大邊際相關性針對與查詢的相似性以及所選文件之間的多樣性進行了優化。
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.max_marginal_relevance_search(query, k=2, fetch_k=10, lambda_param=0.5)
使用預先存在的 OpenSearch 實例
也可以使用帶有已經存在向量的文件的預先存在的 OpenSearch 實例。
# this is just an example, you would need to change these values to point to another opensearch instance
docsearch = OpenSearchVectorSearch(
index_name="index-*",
embedding_function=embeddings,
opensearch_url="https://127.0.0.1:9200",
)
# you can specify custom field names to match the fields you're using to store your embedding, document text value, and metadata
docs = docsearch.similarity_search(
"Who was asking about getting lunch today?",
search_type="script_scoring",
space_type="cosinesimil",
vector_field="message_embedding",
text_field="message",
metadata_field="message_metadata",
)
使用 AOSS(Amazon OpenSearch Service Serverless)
這是一個帶有 faiss
引擎和 efficient_filter
的 AOSS
示例。
我們需要安裝幾個 python
軟體包。
%pip install --upgrade --quiet boto3 requests requests-aws4auth
import boto3
from opensearchpy import RequestsHttpConnection
from requests_aws4auth import AWS4Auth
service = "aoss" # must set the service as 'aoss'
region = "us-east-2"
credentials = boto3.Session(
aws_access_key_id="xxxxxx", aws_secret_access_key="xxxxx"
).get_credentials()
awsauth = AWS4Auth("xxxxx", "xxxxxx", region, service, session_token=credentials.token)
docsearch = OpenSearchVectorSearch.from_documents(
docs,
embeddings,
opensearch_url="host url",
http_auth=awsauth,
timeout=300,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
index_name="test-index-using-aoss",
engine="faiss",
)
docs = docsearch.similarity_search(
"What is feature selection",
efficient_filter=filter,
k=200,
)
使用 AOS(Amazon OpenSearch Service)
%pip install --upgrade --quiet boto3
# This is just an example to show how to use Amazon OpenSearch Service, you need to set proper values.
import boto3
from opensearchpy import RequestsHttpConnection
service = "es" # must set the service as 'es'
region = "us-east-2"
credentials = boto3.Session(
aws_access_key_id="xxxxxx", aws_secret_access_key="xxxxx"
).get_credentials()
awsauth = AWS4Auth("xxxxx", "xxxxxx", region, service, session_token=credentials.token)
docsearch = OpenSearchVectorSearch.from_documents(
docs,
embeddings,
opensearch_url="host url",
http_auth=awsauth,
timeout=300,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
index_name="test-index",
)
docs = docsearch.similarity_search(
"What is feature selection",
k=200,
)