SAP HANA Cloud Vector Engine

SAP HANA Cloud Vector Engine 是一個完全整合到 SAP HANA Cloud 資料庫中的向量儲存庫。

您需要安裝 langchain-community，使用 pip install -qU langchain-community 才能使用此整合功能

設定

HANA 資料庫驅動程式的安裝。

# Pip install necessary package
%pip install --upgrade --quiet  hdbcli

對於 OpenAIEmbeddings，我們使用環境中的 OpenAI API 金鑰。

import os
# Use OPENAI_API_KEY env variable
# os.environ["OPENAI_API_KEY"] = "Your OpenAI API key"

建立與 HANA Cloud 執行個體的資料庫連線。

from dotenv import load_dotenv
from hdbcli import dbapi

load_dotenv()
# Use connection settings from the environment
connection = dbapi.connect(
    address=os.environ.get("HANA_DB_ADDRESS"),
    port=os.environ.get("HANA_DB_PORT"),
    user=os.environ.get("HANA_DB_USER"),
    password=os.environ.get("HANA_DB_PASSWORD"),
    autocommit=True,
    sslValidateCertificate=False,
)

範例

載入範例文件 "state_of_the_union.txt" 並從中建立區塊。

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores.hanavector import HanaDB
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

text_documents = TextLoader("../../how_to/state_of_the_union.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
text_chunks = text_splitter.split_documents(text_documents)
print(f"Number of document chunks: {len(text_chunks)}")

embeddings = OpenAIEmbeddings()

API 參考：TextLoader | HanaDB | Document | OpenAIEmbeddings | CharacterTextSplitter

Number of document chunks: 88

為 HANA 資料庫建立 LangChain VectorStore 介面，並指定用於存取向量嵌入的表格（集合）

db = HanaDB(
    embedding=embeddings, connection=connection, table_name="STATE_OF_THE_UNION"
)

將載入的文件區塊新增至表格。在此範例中，我們刪除表格中先前執行可能存在的任何內容。

# Delete already existing documents from the table
db.delete(filter={})

# add the loaded document chunks
db.add_documents(text_chunks)

[]

執行查詢，從上一步新增的文件區塊中取得兩個最佳匹配的文件區塊。預設情況下，「餘弦相似度」用於搜尋。

query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query, k=2)

for doc in docs:
    print("-" * 80)
    print(doc.page_content)

--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. 

While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.

使用「歐幾里得距離」查詢相同的內容。結果應與「餘弦相似度」相同。

from langchain_community.vectorstores.utils import DistanceStrategy

db = HanaDB(
    embedding=embeddings,
    connection=connection,
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
    table_name="STATE_OF_THE_UNION",
)

query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query, k=2)
for doc in docs:
    print("-" * 80)
    print(doc.page_content)

API 參考：DistanceStrategy

--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. 

While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.

最大邊際相關性搜尋 (MMR)

最大邊際相關性 針對與查詢的相似性以及所選文件之間的多樣性進行最佳化。將從資料庫中檢索前 20 個 (fetch_k) 項目。然後，MMR 演算法將找到最佳的 2 個 (k) 匹配項。

docs = db.max_marginal_relevance_search(query, k=2, fetch_k=20)
for doc in docs:
    print("-" * 80)
    print(doc.page_content)

--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. 

In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. 

Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world.

建立 HNSW 向量索引

向量索引可以顯著加快向量的 top-k 最近鄰查詢。使用者可以使用 create_hnsw_index 函數建立階層式可導航小世界 (HNSW) 向量索引。

有關在資料庫層級建立索引的更多資訊，請參閱官方文件。

# HanaDB instance uses cosine similarity as default:
db_cosine = HanaDB(
    embedding=embeddings, connection=connection, table_name="STATE_OF_THE_UNION"
)

# Attempting to create the HNSW index with default parameters
db_cosine.create_hnsw_index()  # If no other parameters are specified, the default values will be used
# Default values: m=64, ef_construction=128, ef_search=200
# The default index name will be: STATE_OF_THE_UNION_COSINE_SIMILARITY_IDX (verify this naming pattern in HanaDB class)


# Creating a HanaDB instance with L2 distance as the similarity function and defined values
db_l2 = HanaDB(
    embedding=embeddings,
    connection=connection,
    table_name="STATE_OF_THE_UNION",
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,  # Specify L2 distance
)

# This will create an index based on L2 distance strategy.
db_l2.create_hnsw_index(
    index_name="STATE_OF_THE_UNION_L2_index",
    m=100,  # Max number of neighbors per graph node (valid range: 4 to 1000)
    ef_construction=200,  # Max number of candidates during graph construction (valid range: 1 to 100000)
    ef_search=500,  # Min number of candidates during the search (valid range: 1 to 100000)
)

# Use L2 index to perform MMR
docs = db_l2.max_marginal_relevance_search(query, k=2, fetch_k=20)
for doc in docs:
    print("-" * 80)
    print(doc.page_content)

--------------------------------------------------------------------------------
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. 

In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. 

Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world.

重點:

相似性函數：索引的相似性函數預設為餘弦相似度。如果您想使用不同的相似性函數（例如，L2 距離），您需要在初始化 HanaDB 執行個體時指定它。
預設參數：在 create_hnsw_index 函數中，如果使用者未提供參數（如 m、ef_construction 或 ef_search）的自訂值，則將自動使用預設值（例如，m=64、ef_construction=128、ef_search=200）。這些值確保索引在建立時具有合理的效能，而無需使用者介入。

基本向量儲存操作

db = HanaDB(
    connection=connection, embedding=embeddings, table_name="LANGCHAIN_DEMO_BASIC"
)

# Delete already existing documents from the table
db.delete(filter={})

True

我們可以將簡單的文字文件新增到現有表格。

docs = [Document(page_content="Some text"), Document(page_content="Other docs")]
db.add_documents(docs)

[]

新增包含中繼資料的文件。

docs = [
    Document(
        page_content="foo",
        metadata={"start": 100, "end": 150, "doc_name": "foo.txt", "quality": "bad"},
    ),
    Document(
        page_content="bar",
        metadata={"start": 200, "end": 250, "doc_name": "bar.txt", "quality": "good"},
    ),
]
db.add_documents(docs)

[]

查詢具有特定中繼資料的文件。

docs = db.similarity_search("foobar", k=2, filter={"quality": "bad"})
# With filtering on "quality"=="bad", only one document should be returned
for doc in docs:
    print("-" * 80)
    print(doc.page_content)
    print(doc.metadata)

--------------------------------------------------------------------------------
foo
{'start': 100, 'end': 150, 'doc_name': 'foo.txt', 'quality': 'bad'}

刪除具有特定中繼資料的文件。

db.delete(filter={"quality": "bad"})

# Now the similarity search with the same filter will return no results
docs = db.similarity_search("foobar", k=2, filter={"quality": "bad"})
print(len(docs))

進階篩選

除了基本的值型篩選功能外，還可以使用更進階的篩選。下表顯示了可用的篩選運算子。

運算子	語意
`$eq`	相等 (==)
`$ne`	不相等 (!=)
`$lt`	小於 (<)
`$lte`	小於或等於 (<=)
`$gt`	大於 (>)
`$gte`	大於或等於 (>=)
`$in`	包含在一組給定值中 (in)
`$nin`	未包含在一組給定值中 (not in)
`$between`	介於兩個邊界值範圍之間
`$like`	基於 SQL 中 "LIKE" 語意的文字相等性（使用 "%" 作為萬用字元）
`$and`	邏輯 "and"，支援 2 個或更多運算元
`$or`	邏輯 "or"，支援 2 個或更多運算元

# Prepare some test documents
docs = [
    Document(
        page_content="First",
        metadata={"name": "adam", "is_active": True, "id": 1, "height": 10.0},
    ),
    Document(
        page_content="Second",
        metadata={"name": "bob", "is_active": False, "id": 2, "height": 5.7},
    ),
    Document(
        page_content="Third",
        metadata={"name": "jane", "is_active": True, "id": 3, "height": 2.4},
    ),
]

db = HanaDB(
    connection=connection,
    embedding=embeddings,
    table_name="LANGCHAIN_DEMO_ADVANCED_FILTER",
)

# Delete already existing documents from the table
db.delete(filter={})
db.add_documents(docs)


# Helper function for printing filter results
def print_filter_result(result):
    if len(result) == 0:
        print("<empty result>")
    for doc in result:
        print(doc.metadata)

使用 $ne、$gt、$gte、$lt、$lte 進行篩選

advanced_filter = {"id": {"$ne": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"id": {"$gt": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"id": {"$gte": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"id": {"$lt": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"id": {"$lte": 1}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

Filter: {'id': {'$ne': 1}}
{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}
{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}
Filter: {'id': {'$gt': 1}}
{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}
{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}
Filter: {'id': {'$gte': 1}}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}
{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}
Filter: {'id': {'$lt': 1}}
<empty result>
Filter: {'id': {'$lte': 1}}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}

使用 $between、$in、$nin 進行篩選

advanced_filter = {"id": {"$between": (1, 2)}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"name": {"$in": ["adam", "bob"]}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"name": {"$nin": ["adam", "bob"]}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

Filter: {'id': {'$between': (1, 2)}}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'name': {'$in': ['adam', 'bob']}}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'name': {'$nin': ['adam', 'bob']}}
{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}

使用 $like 進行文字篩選

advanced_filter = {"name": {"$like": "a%"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"name": {"$like": "%a%"}}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

Filter: {'name': {'$like': 'a%'}}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
Filter: {'name': {'$like': '%a%'}}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}

使用 $and、$or 進行組合篩選

advanced_filter = {"$or": [{"id": 1}, {"name": "bob"}]}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"$and": [{"id": 1}, {"id": 2}]}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

advanced_filter = {"$or": [{"id": 1}, {"id": 2}, {"id": 3}]}
print(f"Filter: {advanced_filter}")
print_filter_result(db.similarity_search("just testing", k=5, filter=advanced_filter))

Filter: {'$or': [{'id': 1}, {'name': 'bob'}]}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}
Filter: {'$and': [{'id': 1}, {'id': 2}]}
<empty result>
Filter: {'$or': [{'id': 1}, {'id': 2}, {'id': 3}]}
{'name': 'adam', 'is_active': True, 'id': 1, 'height': 10.0}
{'name': 'bob', 'is_active': False, 'id': 2, 'height': 5.7}
{'name': 'jane', 'is_active': True, 'id': 3, 'height': 2.4}

在鏈中使用 VectorStore 作為檢索器，以進行檢索增強生成 (RAG)

from langchain.memory import ConversationBufferMemory
from langchain_openai import ChatOpenAI

# Access the vector DB with a new table
db = HanaDB(
    connection=connection,
    embedding=embeddings,
    table_name="LANGCHAIN_DEMO_RETRIEVAL_CHAIN",
)

# Delete already existing entries from the table
db.delete(filter={})

# add the loaded document chunks from the "State Of The Union" file
db.add_documents(text_chunks)

# Create a retriever instance of the vector store
retriever = db.as_retriever()

API 參考：ConversationBufferMemory | ChatOpenAI

定義提示。

from langchain_core.prompts import PromptTemplate

prompt_template = """
You are an expert in state of the union topics. You are provided multiple context items that are related to the prompt you have to answer.
Use the following pieces of context to answer the question at the end.

'''
{context}
'''

Question: {question}
"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)
chain_type_kwargs = {"prompt": PROMPT}

API 參考：PromptTemplate

建立 ConversationalRetrievalChain，它處理聊天記錄和類似文件區塊的檢索，以便新增到提示中。

from langchain.chains import ConversationalRetrievalChain

llm = ChatOpenAI(model="gpt-3.5-turbo")
memory = ConversationBufferMemory(
    memory_key="chat_history", output_key="answer", return_messages=True
)
qa_chain = ConversationalRetrievalChain.from_llm(
    llm,
    db.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True,
    memory=memory,
    verbose=False,
    combine_docs_chain_kwargs={"prompt": PROMPT},
)

API 參考：ConversationalRetrievalChain

提出第一個問題（並驗證已使用了多少文字區塊）。

question = "What about Mexico and Guatemala?"

result = qa_chain.invoke({"question": question})
print("Answer from LLM:")
print("================")
print(result["answer"])

source_docs = result["source_documents"]
print("================")
print(f"Number of used source document chunks: {len(source_docs)}")

Answer from LLM:
================
The United States has set up joint patrols with Mexico and Guatemala to catch more human traffickers. This collaboration is part of the efforts to address immigration issues and secure the borders in the region.
================
Number of used source document chunks: 5

詳細檢查鏈中使用的區塊。檢查排名最高的區塊是否包含問題中提及的關於「墨西哥和瓜地馬拉」的資訊。

for doc in source_docs:
    print("-" * 80)
    print(doc.page_content)
    print(doc.metadata)

在同一個對話鏈上提出另一個問題。答案應與先前給出的答案相關。

question = "What about other countries?"

result = qa_chain.invoke({"question": question})
print("Answer from LLM:")
print("================")
print(result["answer"])

Answer from LLM:
================
Mexico and Guatemala are involved in joint patrols to catch human traffickers.

標準表格 vs. 具有向量資料的「自訂」表格

作為預設行為，嵌入的表格會使用 3 個欄位建立

一個 VEC_TEXT 欄位，其中包含 Document 的文字
一個 VEC_META 欄位，其中包含 Document 的中繼資料
一個 VEC_VECTOR 欄位，其中包含 Document 文字的嵌入向量

# Access the vector DB with a new table
db = HanaDB(
    connection=connection, embedding=embeddings, table_name="LANGCHAIN_DEMO_NEW_TABLE"
)

# Delete already existing entries from the table
db.delete(filter={})

# Add a simple document with some metadata
docs = [
    Document(
        page_content="A simple document",
        metadata={"start": 100, "end": 150, "doc_name": "simple.txt"},
    )
]
db.add_documents(docs)

[]

顯示表格 "LANGCHAIN_DEMO_NEW_TABLE" 中的欄位

cur = connection.cursor()
cur.execute(
    "SELECT COLUMN_NAME, DATA_TYPE_NAME FROM SYS.TABLE_COLUMNS WHERE SCHEMA_NAME = CURRENT_SCHEMA AND TABLE_NAME = 'LANGCHAIN_DEMO_NEW_TABLE'"
)
rows = cur.fetchall()
for row in rows:
    print(row)
cur.close()

('VEC_META', 'NCLOB')
('VEC_TEXT', 'NCLOB')
('VEC_VECTOR', 'REAL_VECTOR')

顯示插入的三個欄位中文件的值

cur = connection.cursor()
cur.execute(
    "SELECT VEC_TEXT, VEC_META, TO_NVARCHAR(VEC_VECTOR) FROM LANGCHAIN_DEMO_NEW_TABLE LIMIT 1"
)
rows = cur.fetchall()
print(rows[0][0])  # The text
print(rows[0][1])  # The metadata
print(rows[0][2])  # The vector
cur.close()

自訂表格必須至少有三個欄位，這些欄位必須符合標準表格的語意

一個類型為 NCLOB 或 NVARCHAR 的欄位，用於嵌入的文字/上下文
一個類型為 NCLOB 或 NVARCHAR 的欄位，用於中繼資料
一個類型為 REAL_VECTOR 的欄位，用於嵌入向量

表格可以包含其他欄位。當新文件插入到表格中時，這些其他欄位必須允許 NULL 值。

# Create a new table "MY_OWN_TABLE_ADD" with three "standard" columns and one additional column
my_own_table_name = "MY_OWN_TABLE_ADD"
cur = connection.cursor()
cur.execute(
    (
        f"CREATE TABLE {my_own_table_name} ("
        "SOME_OTHER_COLUMN NVARCHAR(42), "
        "MY_TEXT NVARCHAR(2048), "
        "MY_METADATA NVARCHAR(1024), "
        "MY_VECTOR REAL_VECTOR )"
    )
)

# Create a HanaDB instance with the own table
db = HanaDB(
    connection=connection,
    embedding=embeddings,
    table_name=my_own_table_name,
    content_column="MY_TEXT",
    metadata_column="MY_METADATA",
    vector_column="MY_VECTOR",
)

# Add a simple document with some metadata
docs = [
    Document(
        page_content="Some other text",
        metadata={"start": 400, "end": 450, "doc_name": "other.txt"},
    )
]
db.add_documents(docs)

# Check if data has been inserted into our own table
cur.execute(f"SELECT * FROM {my_own_table_name} LIMIT 1")
rows = cur.fetchall()
print(rows[0][0])  # Value of column "SOME_OTHER_DATA". Should be NULL/None
print(rows[0][1])  # The text
print(rows[0][2])  # The metadata
print(rows[0][3])  # The vector

cur.close()

None
Some other text
{"start": 400, "end": 450, "doc_name": "other.txt"}
<memory at 0x7f5edcb18d00>

新增另一個文件並在自訂表格上執行相似性搜尋。

docs = [
    Document(
        page_content="Some more text",
        metadata={"start": 800, "end": 950, "doc_name": "more.txt"},
    )
]
db.add_documents(docs)

query = "What's up?"
docs = db.similarity_search(query, k=2)
for doc in docs:
    print("-" * 80)
    print(doc.page_content)

--------------------------------------------------------------------------------
Some other text
--------------------------------------------------------------------------------
Some more text

使用自訂欄位最佳化篩選效能

為了允許彈性的中繼資料值，預設情況下所有中繼資料都以 JSON 格式儲存在中繼資料欄位中。如果某些使用的中繼資料鍵和值類型已知，則可以將它們儲存在其他欄位中，方法是使用索引鍵名稱作為欄位名稱建立目標表格，並透過 specific_metadata_columns 清單將它們傳遞給 HanaDB 建構子。與這些值匹配的中繼資料索引鍵會在插入期間複製到特殊欄位中。篩選器使用特殊欄位而不是中繼資料 JSON 欄位來取得 specific_metadata_columns 清單中的索引鍵。

# Create a new table "PERFORMANT_CUSTOMTEXT_FILTER" with three "standard" columns and one additional column
my_own_table_name = "PERFORMANT_CUSTOMTEXT_FILTER"
cur = connection.cursor()
cur.execute(
    (
        f"CREATE TABLE {my_own_table_name} ("
        "CUSTOMTEXT NVARCHAR(500), "
        "MY_TEXT NVARCHAR(2048), "
        "MY_METADATA NVARCHAR(1024), "
        "MY_VECTOR REAL_VECTOR )"
    )
)

# Create a HanaDB instance with the own table
db = HanaDB(
    connection=connection,
    embedding=embeddings,
    table_name=my_own_table_name,
    content_column="MY_TEXT",
    metadata_column="MY_METADATA",
    vector_column="MY_VECTOR",
    specific_metadata_columns=["CUSTOMTEXT"],
)

# Add a simple document with some metadata
docs = [
    Document(
        page_content="Some other text",
        metadata={
            "start": 400,
            "end": 450,
            "doc_name": "other.txt",
            "CUSTOMTEXT": "Filters on this value are very performant",
        },
    )
]
db.add_documents(docs)

# Check if data has been inserted into our own table
cur.execute(f"SELECT * FROM {my_own_table_name} LIMIT 1")
rows = cur.fetchall()
print(
    rows[0][0]
)  # Value of column "CUSTOMTEXT". Should be "Filters on this value are very performant"
print(rows[0][1])  # The text
print(
    rows[0][2]
)  # The metadata without the "CUSTOMTEXT" data, as this is extracted into a sperate column
print(rows[0][3])  # The vector

cur.close()

Filters on this value are very performant
Some other text
{"start": 400, "end": 450, "doc_name": "other.txt", "CUSTOMTEXT": "Filters on this value are very performant"}
<memory at 0x7f5edcb193c0>

特殊欄位對於 langchain 介面的其餘部分是完全透明的。一切都像以前一樣運作，只是效能更高。

docs = [
    Document(
        page_content="Some more text",
        metadata={
            "start": 800,
            "end": 950,
            "doc_name": "more.txt",
            "CUSTOMTEXT": "Another customtext value",
        },
    )
]
db.add_documents(docs)

advanced_filter = {"CUSTOMTEXT": {"$like": "%value%"}}
query = "What's up?"
docs = db.similarity_search(query, k=2, filter=advanced_filter)
for doc in docs:
    print("-" * 80)
    print(doc.page_content)

--------------------------------------------------------------------------------
Some other text
--------------------------------------------------------------------------------
Some more text

向量儲存庫概念指南
向量儲存庫操作指南

設定​

範例​

最大邊際相關性搜尋 (MMR)​

建立 HNSW 向量索引​

基本向量儲存操作​

進階篩選​

在鏈中使用 VectorStore 作為檢索器，以進行檢索增強生成 (RAG)​

標準表格 vs. 具有向量資料的「自訂」表格​

使用自訂欄位最佳化篩選效能​

相關連結​

此頁面是否對您有幫助？

設定

範例