跳到主要內容
Open In ColabOpen on GitHub

scikit-learn

scikit-learn 是一個開源的機器學習演算法集合,包括 k 近鄰演算法 的一些實作。SKLearnVectorStore 包裝了這個實作,並增加了將向量資料庫持久化為 json、bson (二進制 json) 或 Apache Parquet 格式的可能性。

這個筆記本展示了如何使用 SKLearnVectorStore 向量資料庫。

您需要使用 `pip install -qU langchain-community` 安裝 `langchain-community` 才能使用此整合

%pip install --upgrade --quiet  scikit-learn

# # if you plan to use bson serialization, install also:
%pip install --upgrade --quiet bson

# # if you plan to use parquet serialization, install also:
%pip install --upgrade --quiet pandas pyarrow

要使用 OpenAI embeddings,您需要一個 OpenAI 金鑰。您可以從 https://platform.openai.com/account/api-keys 取得,或隨時使用任何其他 embeddings。

import os
from getpass import getpass

if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI key:")

基本用法

載入範例文件語料庫

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import SKLearnVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()

建立 SKLearnVectorStore,為文件語料庫建立索引,並執行範例查詢

import tempfile

persist_path = os.path.join(tempfile.gettempdir(), "union.parquet")

vector_store = SKLearnVectorStore.from_documents(
documents=docs,
embedding=embeddings,
persist_path=persist_path, # persist_path and serializer are optional
serializer="parquet",
)

query = "What did the president say about Ketanji Brown Jackson"
docs = vector_store.similarity_search(query)
print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

儲存和載入向量資料庫

vector_store.persist()
print("Vector store was persisted to", persist_path)
Vector store was persisted to /var/folders/6r/wc15p6m13nl_nl_n_xfqpc5c0000gp/T/union.parquet
vector_store2 = SKLearnVectorStore(
embedding=embeddings, persist_path=persist_path, serializer="parquet"
)
print("A new instance of vector store was loaded from", persist_path)
A new instance of vector store was loaded from /var/folders/6r/wc15p6m13nl_nl_n_xfqpc5c0000gp/T/union.parquet
docs = vector_store2.similarity_search(query)
print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

清理

os.remove(persist_path)

此頁面有幫助嗎?