Oracle AI 向量搜尋:向量儲存 (Oracle AI Vector Search: Vector Store)
Oracle AI 向量搜尋專為人工智慧 (AI) 工作負載而設計,可讓您根據語意而非關鍵字來查詢資料。Oracle AI 向量搜尋的最大優點之一是,非結構化資料的語意搜尋可以與單一系統中業務資料的關聯式搜尋相結合。這不僅功能強大,而且效率更高,因為您不需要新增專用的向量資料庫,從而消除了多個系統之間資料分散的痛苦。
此外,您的向量可以受益於 Oracle 資料庫所有最強大的功能,例如以下內容:
- 分割區支援 (Partitioning Support)
- 即時應用程式叢集可擴展性 (Real Application Clusters scalability)
- Exadata 智慧掃描 (Exadata smart scans)
- 跨地理分散資料庫的碎片處理 (Shard processing across geographically distributed databases)
- 交易 (Transactions)
- 並行 SQL (Parallel SQL)
- 災難復原 (Disaster recovery)
- 安全性 (Security)
- Oracle Machine Learning
- Oracle Graph Database
- Oracle Spatial and Graph
- Oracle Blockchain
- JSON
如果您剛開始使用 Oracle 資料庫,請考慮探索免費的 Oracle 23 AI,它提供了設定資料庫環境的絕佳介紹。在使用資料庫時,通常建議預設情況下避免使用系統使用者;相反,您可以建立自己的使用者以增強安全性和自訂性。有關使用者建立的詳細步驟,請參閱我們的端對端指南,其中也展示了如何在 Oracle 中設定使用者。此外,了解使用者權限對於有效管理資料庫安全性至關重要。您可以在官方 Oracle 指南中找到有關管理使用者帳戶和安全性的更多資訊。
使用 Langchain 和 Oracle AI 向量搜尋的先決條件
您需要安裝langchain-community
,使用pip install -qU langchain-community
才能使用此整合
請安裝 Oracle Python 用戶端驅動程式,以便將 Langchain 與 Oracle AI 向量搜尋搭配使用。
# pip install oracledb
連接到 Oracle AI 向量搜尋
以下範例程式碼將展示如何連接到 Oracle 資料庫。預設情況下,python-oracledb 以「Thin」模式執行,該模式直接連接到 Oracle 資料庫。此模式不需要 Oracle 用戶端程式庫。但是,當 python-oracledb 使用它們時,可以使用一些額外的功能。當使用 Oracle 用戶端程式庫時,python-oracledb 被稱為處於「Thick」模式。兩種模式都具有全面的功能,支援 Python 資料庫 API v2.0 規範。請參閱以下指南,其中討論了每種模式支援的功能。如果您無法使用 thin-mode,您可能需要切換到 thick-mode。
import oracledb
username = "username"
password = "password"
dsn = "ipaddress:port/orclpdb1"
try:
connection = oracledb.connect(user=username, password=password, dsn=dsn)
print("Connection successful!")
except Exception as e:
print("Connection failed!")
匯入使用 Oracle AI 向量搜尋所需的依賴項
from langchain_community.vectorstores import oraclevs
from langchain_community.vectorstores.oraclevs import OracleVS
from langchain_community.vectorstores.utils import DistanceStrategy
from langchain_core.documents import Document
from langchain_huggingface import HuggingFaceEmbeddings
載入文件
# Define a list of documents (The examples below are 5 random documents from Oracle Concepts Manual )
documents_json_list = [
{
"id": "cncpt_15.5.3.2.2_P4",
"text": "If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.",
"link": "https://docs.oracle.com/en/database/oracle/oracle-database/23/cncpt/logical-storage-structures.html#GUID-5387D7B2-C0CA-4C1E-811B-C7EB9B636442",
},
{
"id": "cncpt_15.5.5_P1",
"text": "A tablespace can be online (accessible) or offline (not accessible) whenever the database is open.\nA tablespace is usually online so that its data is available to users. The SYSTEM tablespace and temporary tablespaces cannot be taken offline.",
"link": "https://docs.oracle.com/en/database/oracle/oracle-database/23/cncpt/logical-storage-structures.html#GUID-D02B2220-E6F5-40D9-AFB5-BC69BCEF6CD4",
},
{
"id": "cncpt_22.3.4.3.1_P2",
"text": "The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table.\nSometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.",
"link": "https://docs.oracle.com/en/database/oracle/oracle-database/23/cncpt/concepts-for-database-developers.html#GUID-3C50EAB8-FC39-4BB3-B680-4EACCE49E866",
},
{
"id": "cncpt_22.3.4.3.1_P3",
"text": "The LOB segment stores data in pieces called chunks. A chunk is a logically contiguous set of data blocks and is the smallest unit of allocation for a LOB. A row in the table stores a pointer called a LOB locator, which points to the LOB index. When the table is queried, the database uses the LOB index to quickly locate the LOB chunks.",
"link": "https://docs.oracle.com/en/database/oracle/oracle-database/23/cncpt/concepts-for-database-developers.html#GUID-3C50EAB8-FC39-4BB3-B680-4EACCE49E866",
},
]
# Create Langchain Documents
documents_langchain = []
for doc in documents_json_list:
metadata = {"id": doc["id"], "link": doc["link"]}
doc_langchain = Document(page_content=doc["text"], metadata=metadata)
documents_langchain.append(doc_langchain)
使用 AI 向量搜尋建立具有不同距離指標的向量儲存
首先,我們會建立三個向量儲存區,每個儲存區使用不同的距離函數。由於我們尚未在其中建立索引,因此它們現在只會建立表格。稍後,我們將使用這些向量儲存區來建立 HNSW 索引。若要了解更多關於 Oracle AI Vector Search 支援的不同索引類型,請參閱以下指南。
您可以手動連線到 Oracle Database,並看到三個表格:Documents_DOT、Documents_COSINE 和 Documents_EUCLIDEAN。
接著,我們會建立另外三個表格 Documents_DOT_IVF、Documents_COSINE_IVF 和 Documents_EUCLIDEAN_IVF,它們將用於在表格上建立 IVF 索引,而不是 HNSW 索引。
# Ingest documents into Oracle Vector Store using different distance strategies
# When using our API calls, start by initializing your vector store with a subset of your documents
# through from_documents(), then incrementally add more documents using add_texts().
# This approach prevents system overload and ensures efficient document processing.
model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
vector_store_dot = OracleVS.from_documents(
documents_langchain,
model,
client=connection,
table_name="Documents_DOT",
distance_strategy=DistanceStrategy.DOT_PRODUCT,
)
vector_store_max = OracleVS.from_documents(
documents_langchain,
model,
client=connection,
table_name="Documents_COSINE",
distance_strategy=DistanceStrategy.COSINE,
)
vector_store_euclidean = OracleVS.from_documents(
documents_langchain,
model,
client=connection,
table_name="Documents_EUCLIDEAN",
distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
)
# Ingest documents into Oracle Vector Store using different distance strategies
vector_store_dot_ivf = OracleVS.from_documents(
documents_langchain,
model,
client=connection,
table_name="Documents_DOT_IVF",
distance_strategy=DistanceStrategy.DOT_PRODUCT,
)
vector_store_max_ivf = OracleVS.from_documents(
documents_langchain,
model,
client=connection,
table_name="Documents_COSINE_IVF",
distance_strategy=DistanceStrategy.COSINE,
)
vector_store_euclidean_ivf = OracleVS.from_documents(
documents_langchain,
model,
client=connection,
table_name="Documents_EUCLIDEAN_IVF",
distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
)
展示文字的添加和刪除操作,以及基本的相似度搜尋
def manage_texts(vector_stores):
"""
Adds texts to each vector store, demonstrates error handling for duplicate additions,
and performs deletion of texts. Showcases similarity searches and index creation for each vector store.
Args:
- vector_stores (list): A list of OracleVS instances.
"""
texts = ["Rohan", "Shailendra"]
metadata = [
{"id": "100", "link": "Document Example Test 1"},
{"id": "101", "link": "Document Example Test 2"},
]
for i, vs in enumerate(vector_stores, start=1):
# Adding texts
try:
vs.add_texts(texts, metadata)
print(f"\n\n\nAdd texts complete for vector store {i}\n\n\n")
except Exception as ex:
print(f"\n\n\nExpected error on duplicate add for vector store {i}\n\n\n")
# Deleting texts using the value of 'id'
vs.delete([metadata[0]["id"]])
print(f"\n\n\nDelete texts complete for vector store {i}\n\n\n")
# Similarity search
results = vs.similarity_search("How are LOBS stored in Oracle Database", 2)
print(f"\n\n\nSimilarity search results for vector store {i}: {results}\n\n\n")
vector_store_list = [
vector_store_dot,
vector_store_max,
vector_store_euclidean,
vector_store_dot_ivf,
vector_store_max_ivf,
vector_store_euclidean_ivf,
]
manage_texts(vector_store_list)
展示使用特定參數為每個距離策略建立索引
def create_search_indices(connection):
"""
Creates search indices for the vector stores, each with specific parameters tailored to their distance strategy.
"""
# Index for DOT_PRODUCT strategy
# Notice we are creating a HNSW index with default parameters
# This will default to creating a HNSW index with 8 Parallel Workers and use the Default Accuracy used by Oracle AI Vector Search
oraclevs.create_index(
connection,
vector_store_dot,
params={"idx_name": "hnsw_idx1", "idx_type": "HNSW"},
)
# Index for COSINE strategy with specific parameters
# Notice we are creating a HNSW index with parallel 16 and Target Accuracy Specification as 97 percent
oraclevs.create_index(
connection,
vector_store_max,
params={
"idx_name": "hnsw_idx2",
"idx_type": "HNSW",
"accuracy": 97,
"parallel": 16,
},
)
# Index for EUCLIDEAN_DISTANCE strategy with specific parameters
# Notice we are creating a HNSW index by specifying Power User Parameters which are neighbors = 64 and efConstruction = 100
oraclevs.create_index(
connection,
vector_store_euclidean,
params={
"idx_name": "hnsw_idx3",
"idx_type": "HNSW",
"neighbors": 64,
"efConstruction": 100,
},
)
# Index for DOT_PRODUCT strategy with specific parameters
# Notice we are creating an IVF index with default parameters
# This will default to creating an IVF index with 8 Parallel Workers and use the Default Accuracy used by Oracle AI Vector Search
oraclevs.create_index(
connection,
vector_store_dot_ivf,
params={
"idx_name": "ivf_idx1",
"idx_type": "IVF",
},
)
# Index for COSINE strategy with specific parameters
# Notice we are creating an IVF index with parallel 32 and Target Accuracy Specification as 90 percent
oraclevs.create_index(
connection,
vector_store_max_ivf,
params={
"idx_name": "ivf_idx2",
"idx_type": "IVF",
"accuracy": 90,
"parallel": 32,
},
)
# Index for EUCLIDEAN_DISTANCE strategy with specific parameters
# Notice we are creating an IVF index by specifying Power User Parameters which is neighbor_part = 64
oraclevs.create_index(
connection,
vector_store_euclidean_ivf,
params={"idx_name": "ivf_idx3", "idx_type": "IVF", "neighbor_part": 64},
)
print("Index creation complete.")
create_search_indices(connection)
展示在所有六個向量儲存區上進行進階搜尋,無論有無屬性篩選 – 使用篩選時,我們僅選擇文件 ID 101,沒有其他
# Conduct advanced searches after creating the indices
def conduct_advanced_searches(vector_stores):
query = "How are LOBS stored in Oracle Database"
# Constructing a filter for direct comparison against document metadata
# This filter aims to include documents whose metadata 'id' is exactly '2'
filter_criteria = {"id": ["101"]} # Direct comparison filter
for i, vs in enumerate(vector_stores, start=1):
print(f"\n--- Vector Store {i} Advanced Searches ---")
# Similarity search without a filter
print("\nSimilarity search results without filter:")
print(vs.similarity_search(query, 2))
# Similarity search with a filter
print("\nSimilarity search results with filter:")
print(vs.similarity_search(query, 2, filter=filter_criteria))
# Similarity search with relevance score
print("\nSimilarity search with relevance score:")
print(vs.similarity_search_with_score(query, 2))
# Similarity search with relevance score with filter
print("\nSimilarity search with relevance score with filter:")
print(vs.similarity_search_with_score(query, 2, filter=filter_criteria))
# Max marginal relevance search
print("\nMax marginal relevance search results:")
print(vs.max_marginal_relevance_search(query, 2, fetch_k=20, lambda_mult=0.5))
# Max marginal relevance search with filter
print("\nMax marginal relevance search results with filter:")
print(
vs.max_marginal_relevance_search(
query, 2, fetch_k=20, lambda_mult=0.5, filter=filter_criteria
)
)
conduct_advanced_searches(vector_store_list)
端對端示範
請參閱我們的完整示範指南Oracle AI Vector Search 端對端示範指南,以在 Oracle AI Vector Search 的幫助下建立端對端 RAG 管道。