MyScale

MyScale 是一個基於雲端的資料庫，針對 AI 應用程式和解決方案進行了最佳化，並建立在開源的 ClickHouse 之上。

這個筆記本展示了如何使用與 MyScale 向量資料庫相關的功能。

設定環境

%pip install --upgrade --quiet  clickhouse-connect langchain-community

我們想要使用 OpenAIEmbeddings，所以我們必須取得 OpenAI API 金鑰。

import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
if "OPENAI_API_BASE" not in os.environ:
    os.environ["OPENAI_API_BASE"] = getpass.getpass("OpenAI Base:")
if "MYSCALE_HOST" not in os.environ:
    os.environ["MYSCALE_HOST"] = getpass.getpass("MyScale Host:")
if "MYSCALE_PORT" not in os.environ:
    os.environ["MYSCALE_PORT"] = getpass.getpass("MyScale Port:")
if "MYSCALE_USERNAME" not in os.environ:
    os.environ["MYSCALE_USERNAME"] = getpass.getpass("MyScale Username:")
if "MYSCALE_PASSWORD" not in os.environ:
    os.environ["MYSCALE_PASSWORD"] = getpass.getpass("MyScale Password:")

有兩種方法可以設定 myscale 索引的參數。

環境變數

在您執行應用程式之前，請使用 export 設定環境變數：export MYSCALE_HOST='<your-endpoints-url>' MYSCALE_PORT=<your-endpoints-port> MYSCALE_USERNAME=<your-username> MYSCALE_PASSWORD=<your-password> ...

您可以在我們的 SaaS 上輕鬆找到您的帳戶、密碼和其他資訊。詳情請參閱此文件

MyScaleSettings 下的每個屬性都可以使用前綴 MYSCALE_ 進行設定，並且不區分大小寫。

使用參數建立 MyScaleSettings 物件

from langchain_community.vectorstores import MyScale, MyScaleSettings
config = MyScaleSetting(host="<your-backend-url>", port=8443, ...)
index = MyScale(embedding_function, config)
index.add_documents(...)

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import MyScale
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

API 參考：TextLoader | MyScale | OpenAIEmbeddings | CharacterTextSplitter

from langchain_community.document_loaders import TextLoader

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

API 參考：TextLoader

for d in docs:
    d.metadata = {"some": "metadata"}
docsearch = MyScale.from_documents(docs, embeddings)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)

Inserting data...: 100%|██████████| 42/42 [00:15<00:00,  2.66it/s]

print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

取得連線資訊和資料結構描述

print(str(docsearch))

篩選

您可以直接存取 myscale SQL where 語句。您可以按照標準 SQL 編寫 WHERE 子句。

**注意**：請注意 SQL 注入，此介面不得由最終使用者直接調用。

如果您在設定下自訂了 column_map，您可以像這樣使用篩選器進行搜尋

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import MyScale

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

for i, d in enumerate(docs):
    d.metadata = {"doc_id": i}

docsearch = MyScale.from_documents(docs, embeddings)

API 參考：TextLoader | MyScale

Inserting data...: 100%|██████████| 42/42 [00:15<00:00,  2.68it/s]

相似度搜尋與分數

返回的距離分數是餘弦距離。因此，分數越低越好。

meta = docsearch.metadata_column
output = docsearch.similarity_search_with_relevance_scores(
    "What did the president say about Ketanji Brown Jackson?",
    k=4,
    where_str=f"{meta}.doc_id<10",
)
for d, dist in output:
    print(dist, d.metadata, d.page_content[:20] + "...")

229655921459198 {'doc_id': 0} Madam Speaker, Madam...
24506962299346924 {'doc_id': 8} And so many families...
24786919355392456 {'doc_id': 1} Groups of citizens b...
24875116348266602 {'doc_id': 6} And I’m taking robus...

刪除您的資料

您可以使用 .drop() 方法刪除表格，或使用 .delete() 方法部分刪除您的資料。

# use directly a `where_str` to delete
docsearch.delete(where_str=f"{docsearch.metadata_column}.doc_id < 5")
meta = docsearch.metadata_column
output = docsearch.similarity_search_with_relevance_scores(
    "What did the president say about Ketanji Brown Jackson?",
    k=4,
    where_str=f"{meta}.doc_id<10",
)
for d, dist in output:
    print(dist, d.metadata, d.page_content[:20] + "...")

24506962299346924 {'doc_id': 8} And so many families...
24875116348266602 {'doc_id': 6} And I’m taking robus...
26027143001556396 {'doc_id': 7} We see the unity amo...
26390212774276733 {'doc_id': 9} And unlike the $2 Tr...

docsearch.drop()

向量儲存庫概念指南
向量儲存庫操作指南

設定環境​

取得連線資訊和資料結構描述​

篩選​

相似度搜尋與分數​

刪除您的資料​

相關連結​

這個頁面有幫助嗎？

設定環境

取得連線資訊和資料結構描述

篩選

相似度搜尋與分數

刪除您的資料

相關連結