Oracle AI 向量搜尋：文件處理

Oracle AI 向量搜尋專為人工智慧 (AI) 工作負載設計，讓您可以根據語意而非關鍵字查詢資料。Oracle AI 向量搜尋的最大優勢之一是，非結構化資料的語意搜尋可以與單一系統中業務資料的關聯式搜尋相結合。這不僅功能強大，而且效率更高，因為您無需新增專門的向量資料庫，從而消除了多個系統之間資料碎片化的痛苦。

此外，您的向量可以受益於 Oracle Database 的所有最強大功能，例如以下功能

本指南示範如何在 Oracle AI 向量搜尋中使用文件處理功能，分別使用 OracleDocLoader 和 OracleTextSplitter 來載入和分塊文件。

如果您剛開始使用 Oracle Database，請考慮探索免費的 Oracle 23 AI，它為設定資料庫環境提供了很好的介紹。在使用資料庫時，通常建議避免預設使用系統使用者；相反，您可以建立自己的使用者以增強安全性和自訂性。有關使用者建立的詳細步驟，請參閱我們的端對端指南，其中也展示了如何在 Oracle 中設定使用者。此外，了解使用者權限對於有效管理資料庫安全性至關重要。您可以在官方 Oracle 指南中了解有關此主題的更多資訊，該指南介紹了如何管理使用者帳戶和安全性。

先決條件

請安裝 Oracle Python Client 驅動程式，以便將 Langchain 與 Oracle AI 向量搜尋搭配使用。

# pip install oracledb

連線到 Oracle Database

以下範例程式碼將示範如何連線到 Oracle Database。預設情況下，python-oracledb 在「Thin」模式下執行，該模式直接連線到 Oracle Database。此模式不需要 Oracle Client 程式庫。但是，當 python-oracledb 使用它們時，可以使用一些額外功能。當使用 Oracle Client 程式庫時，python-oracledb 被稱為處於「Thick」模式。兩種模式都具有全面的功能，支援 Python Database API v2.0 規範。請參閱以下指南，其中討論了每種模式支援的功能。如果您無法使用 thin 模式，您可能需要切換到 thick 模式。

import sys

import oracledb

# please update with your username, password, hostname and service_name
username = "<username>"
password = "<password>"
dsn = "<hostname>/<service_name>"

try:
    conn = oracledb.connect(user=username, password=password, dsn=dsn)
    print("Connection successful!")
except Exception as e:
    print("Connection failed!")
    sys.exit(1)

現在，讓我們建立一個表格並插入一些範例文件以進行測試。

try:
    cursor = conn.cursor()

    drop_table_sql = """drop table if exists demo_tab"""
    cursor.execute(drop_table_sql)

    create_table_sql = """create table demo_tab (id number, data clob)"""
    cursor.execute(create_table_sql)

    insert_row_sql = """insert into demo_tab values (:1, :2)"""
    rows_to_insert = [
        (
            1,
            "If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.",
        ),
        (
            2,
            "A tablespace can be online (accessible) or offline (not accessible) whenever the database is open.\nA tablespace is usually online so that its data is available to users. The SYSTEM tablespace and temporary tablespaces cannot be taken offline.",
        ),
        (
            3,
            "The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table.\nSometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.",
        ),
    ]
    cursor.executemany(insert_row_sql, rows_to_insert)

    conn.commit()

    print("Table created and populated.")
    cursor.close()
except Exception as e:
    print("Table creation failed.")
    cursor.close()
    conn.close()
    sys.exit(1)

載入文件

使用者可以靈活地透過適當設定載入器參數，從 Oracle Database、檔案系統或兩者載入文件。有關這些參數的完整詳細資訊，請參閱Oracle AI 向量搜尋指南。

使用 OracleDocLoader 的一個顯著優勢是它能夠處理超過 150 種不同的檔案格式，從而無需為不同的文件類型使用多個載入器。有關支援格式的完整清單，請參閱Oracle Text 支援的文件格式。

以下是一個範例程式碼片段，示範如何使用 OracleDocLoader

from langchain_community.document_loaders.oracleai import OracleDocLoader
from langchain_core.documents import Document

"""
# loading a local file
loader_params = {}
loader_params["file"] = "<file>"

# loading from a local directory
loader_params = {}
loader_params["dir"] = "<directory>"
"""

# loading from Oracle Database table
loader_params = {
    "owner": "<owner>",
    "tablename": "demo_tab",
    "colname": "data",
}

""" load the docs """
loader = OracleDocLoader(conn=conn, params=loader_params)
docs = loader.load()

""" verify """
print(f"Number of docs loaded: {len(docs)}")
# print(f"Document-0: {docs[0].page_content}") # content

API 參考：OracleDocLoader | Document

分割文件

文件的大小可能有所不同，從小到非常大。使用者通常喜歡將文件分塊成較小的部分，以方便產生嵌入。針對此分割過程，提供了各種自訂選項。有關這些參數的完整詳細資訊，請參閱Oracle AI 向量搜尋指南。

以下是一個範例程式碼，說明如何實作此功能

from langchain_community.document_loaders.oracleai import OracleTextSplitter
from langchain_core.documents import Document

"""
# Some examples
# split by chars, max 500 chars
splitter_params = {"split": "chars", "max": 500, "normalize": "all"}

# split by words, max 100 words
splitter_params = {"split": "words", "max": 100, "normalize": "all"}

# split by sentence, max 20 sentences
splitter_params = {"split": "sentence", "max": 20, "normalize": "all"}
"""

# split by default parameters
splitter_params = {"normalize": "all"}

# get the splitter instance
splitter = OracleTextSplitter(conn=conn, params=splitter_params)

list_chunks = []
for doc in docs:
    chunks = splitter.split_text(doc.page_content)
    list_chunks.extend(chunks)

""" verify """
print(f"Number of Chunks: {len(list_chunks)}")
# print(f"Chunk-0: {list_chunks[0]}") # content

API 參考：OracleTextSplitter | Document

端對端示範

請參閱我們的完整示範指南Oracle AI 向量搜尋端對端示範指南，以在 Oracle AI 向量搜尋的協助下建立端對端 RAG 管道。

文件載入器概念指南
文件載入器操作指南

先決條件​

連線到 Oracle Database​

載入文件​

分割文件​

端對端示範​

相關內容​

此頁面是否有幫助？

先決條件

連線到 Oracle Database

載入文件

分割文件

端對端示範

相關內容