使用 PebbloRetrievalQA 的身分驗證 RAG
PebbloRetrievalQA 是一個檢索鏈,具有身分驗證和語意強制執行功能,可針對向量資料庫進行問答。
本筆記本涵蓋如何使用身分驗證和語意強制執行(拒絕主題/實體)來檢索文件。如需 Pebblo 及其 SafeRetriever 功能的更多詳細資訊,請造訪 Pebblo 文件
步驟:
- 載入文件: 我們將載入具有授權和語意元資料的文件到記憶體中的 Qdrant 向量儲存中。此向量儲存將在 PebbloRetrievalQA 中用作檢索器。
注意: 建議使用 PebbloSafeLoader 作為在擷取端載入具有身分驗證和語意元資料的文件的對應項。
PebbloSafeLoader
保證安全有效地載入文件,同時維持元資料的完整性。
- 測試強制執行機制:我們將分別測試身分驗證和語意強制執行。對於每個用例,我們將定義一個具有所需上下文(auth_context 和 semantic_context)的特定「ask」函數,然後提出我們的問題。
設定
依賴項目
在此逐步解說中,我們將使用 OpenAI LLM、OpenAI 嵌入和 Qdrant 向量儲存。
%pip install --upgrade --quiet langchain langchain_core langchain-community langchain-openai qdrant_client
身分感知資料擷取
在這裡,我們使用 Qdrant 作為向量資料庫;但是,您可以使用任何支援的向量資料庫。
PebbloRetrievalQA 鏈支援以下向量資料庫
- Qdrant
- Pinecone
- Postgres(使用 pgvector 擴充功能)
在元資料中載入具有授權和語意資訊的向量資料庫
在此步驟中,我們將來源文件的授權和語意資訊擷取到每個區塊的 VectorDB 條目的元資料中的 authorized_identities
、pebblo_semantic_topics
和 pebblo_semantic_entities
欄位中。
注意:若要使用 PebbloRetrievalQA 鏈,您必須始終將授權和語意元資料放在指定的欄位中。這些欄位必須包含字串清單。
from langchain_community.vectorstores.qdrant import Qdrant
from langchain_core.documents import Document
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_openai.llms import OpenAI
llm = OpenAI()
embeddings = OpenAIEmbeddings()
collection_name = "pebblo-identity-and-semantic-rag"
page_content = """
**ACME Corp Financial Report**
**Overview:**
ACME Corp, a leading player in the merger and acquisition industry, presents its financial report for the fiscal year ending December 31, 2020.
Despite a challenging economic landscape, ACME Corp demonstrated robust performance and strategic growth.
**Financial Highlights:**
Revenue soared to $50 million, marking a 15% increase from the previous year, driven by successful deal closures and expansion into new markets.
Net profit reached $12 million, showcasing a healthy margin of 24%.
**Key Metrics:**
Total assets surged to $80 million, reflecting a 20% growth, highlighting ACME Corp's strong financial position and asset base.
Additionally, the company maintained a conservative debt-to-equity ratio of 0.5, ensuring sustainable financial stability.
**Future Outlook:**
ACME Corp remains optimistic about the future, with plans to capitalize on emerging opportunities in the global M&A landscape.
The company is committed to delivering value to shareholders while maintaining ethical business practices.
**Bank Account Details:**
For inquiries or transactions, please refer to ACME Corp's US bank account:
Account Number: 123456789012
Bank Name: Fictitious Bank of America
"""
documents = [
Document(
**{
"page_content": page_content,
"metadata": {
"pebblo_semantic_topics": ["financial-report"],
"pebblo_semantic_entities": ["us-bank-account-number"],
"authorized_identities": ["finance-team", "exec-leadership"],
"page": 0,
"source": "https://drive.google.com/file/d/xxxxxxxxxxxxx/view",
"title": "ACME Corp Financial Report.pdf",
},
}
)
]
vectordb = Qdrant.from_documents(
documents,
embeddings,
location=":memory:",
collection_name=collection_name,
)
print("Vectordb loaded.")
Vectordb loaded.
使用身分驗證強制執行進行檢索
PebbloRetrievalQA 鏈使用 SafeRetrieval 來強制執行用於上下文範例的程式碼片段僅從授權給使用者的文件中檢索。為了實現此目的,Gen-AI 應用程式需要為此檢索鏈提供授權上下文。此 auth_context 應填寫存取 Gen-AI 應用程式的使用者的身分和授權群組。
以下是 PebbloRetrievalQA
的範例程式碼,其中包含從存取 RAG 應用程式的使用者傳遞到 auth_context
中的 user_auth
(使用者授權清單,可能包含其使用者 ID 和其所屬群組)。
from langchain_community.chains import PebbloRetrievalQA
from langchain_community.chains.pebblo_retrieval.models import AuthContext, ChainInput
# Initialize PebbloRetrievalQA chain
qa_chain = PebbloRetrievalQA.from_chain_type(
llm=llm,
retriever=vectordb.as_retriever(),
app_name="pebblo-identity-rag",
description="Identity Enforcement app using PebbloRetrievalQA",
owner="ACME Corp",
)
def ask(question: str, auth_context: dict):
"""
Ask a question to the PebbloRetrievalQA chain
"""
auth_context_obj = AuthContext(**auth_context) if auth_context else None
chain_input_obj = ChainInput(query=question, auth_context=auth_context_obj)
return qa_chain.invoke(chain_input_obj.dict())
1. 授權使用者提出的問題
我們擷取了授權身分 ["finance-team", "exec-leadership"]
的資料,因此具有授權身分/群組 finance-team
的使用者應收到正確的答案。
auth = {
"user_id": "finance-user@acme.org",
"user_auth": [
"finance-team",
],
}
question = "Share the financial performance of ACME Corp for the year 2020"
resp = ask(question, auth)
print(f"Question: {question}\n\nAnswer: {resp['result']}")
Question: Share the financial performance of ACME Corp for the year 2020
Answer:
Revenue: $50 million (15% increase from previous year)
Net profit: $12 million (24% margin)
Total assets: $80 million (20% growth)
Debt-to-equity ratio: 0.5
2. 未授權使用者提出的問題
由於使用者的授權身分/群組 eng-support
未包含在授權身分 ["finance-team", "exec-leadership"]
中,因此我們不應收到答案。
auth = {
"user_id": "eng-user@acme.org",
"user_auth": [
"eng-support",
],
}
question = "Share the financial performance of ACME Corp for the year 2020"
resp = ask(question, auth)
print(f"Question: {question}\n\nAnswer: {resp['result']}")
Question: Share the financial performance of ACME Corp for the year 2020
Answer: I don't know.
3. 使用 PromptTemplate 提供其他指示
您可以使用 PromptTemplate 向 LLM 提供其他指示,以產生自訂回應。
from langchain_core.prompts import PromptTemplate
prompt_template = PromptTemplate.from_template(
"""
Answer the question using the provided context.
If no context is provided, just say "I'm sorry, but that information is unavailable, or Access to it is restricted.".
Question: {question}
"""
)
question = "Share the financial performance of ACME Corp for the year 2020"
prompt = prompt_template.format(question=question)
3.1 授權使用者提出的問題
auth = {
"user_id": "finance-user@acme.org",
"user_auth": [
"finance-team",
],
}
resp = ask(prompt, auth)
print(f"Question: {question}\n\nAnswer: {resp['result']}")
Question: Share the financial performance of ACME Corp for the year 2020
Answer:
Revenue soared to $50 million, marking a 15% increase from the previous year, and net profit reached $12 million, showcasing a healthy margin of 24%. Total assets also grew by 20% to $80 million, and the company maintained a conservative debt-to-equity ratio of 0.5.
3.2 未授權使用者提出的問題
auth = {
"user_id": "eng-user@acme.org",
"user_auth": [
"eng-support",
],
}
resp = ask(prompt, auth)
print(f"Question: {question}\n\nAnswer: {resp['result']}")
Question: Share the financial performance of ACME Corp for the year 2020
Answer:
I'm sorry, but that information is unavailable, or Access to it is restricted.
使用語意強制執行進行檢索
PebbloRetrievalQA 鏈使用 SafeRetrieval 來確保在上下文中使用的程式碼片段僅從符合提供的語意上下文的文件中檢索。為了實現此目的,Gen-AI 應用程式必須為此檢索鏈提供語意上下文。此 semantic_context
應包含應拒絕存取 Gen-AI 應用程式的使用者的主題和實體。
以下是 PebbloRetrievalQA 的範例程式碼,其中包含 topics_to_deny
和 entities_to_deny
。這些會傳遞到鏈輸入的 semantic_context
中。
from typing import List, Optional
from langchain_community.chains import PebbloRetrievalQA
from langchain_community.chains.pebblo_retrieval.models import (
ChainInput,
SemanticContext,
)
# Initialize PebbloRetrievalQA chain
qa_chain = PebbloRetrievalQA.from_chain_type(
llm=llm,
retriever=vectordb.as_retriever(),
app_name="pebblo-semantic-rag",
description="Semantic Enforcement app using PebbloRetrievalQA",
owner="ACME Corp",
)
def ask(
question: str,
topics_to_deny: Optional[List[str]] = None,
entities_to_deny: Optional[List[str]] = None,
):
"""
Ask a question to the PebbloRetrievalQA chain
"""
semantic_context = dict()
if topics_to_deny:
semantic_context["pebblo_semantic_topics"] = {"deny": topics_to_deny}
if entities_to_deny:
semantic_context["pebblo_semantic_entities"] = {"deny": entities_to_deny}
semantic_context_obj = (
SemanticContext(**semantic_context) if semantic_context else None
)
chain_input_obj = ChainInput(query=question, semantic_context=semantic_context_obj)
return qa_chain.invoke(chain_input_obj.dict())
1. 不使用語意強制執行
由於未應用語意強制執行,系統應傳回答案,而不會因為與上下文相關聯的語意標籤而排除任何上下文。
topic_to_deny = []
entities_to_deny = []
question = "Share the financial performance of ACME Corp for the year 2020"
resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)
print(
f"Topics to deny: {topic_to_deny}\nEntities to deny: {entities_to_deny}\n"
f"Question: {question}\nAnswer: {resp['result']}"
)
Topics to deny: []
Entities to deny: []
Question: Share the financial performance of ACME Corp for the year 2020
Answer:
Revenue for ACME Corp increased by 15% to $50 million in 2020, with a net profit of $12 million and a strong asset base of $80 million. The company also maintained a conservative debt-to-equity ratio of 0.5.
2. 拒絕 financial-report 主題
資料已使用主題擷取:["financial-report"]
。因此,拒絕 financial-report
主題的應用程式不應收到答案。
topic_to_deny = ["financial-report"]
entities_to_deny = []
question = "Share the financial performance of ACME Corp for the year 2020"
resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)
print(
f"Topics to deny: {topic_to_deny}\nEntities to deny: {entities_to_deny}\n"
f"Question: {question}\nAnswer: {resp['result']}"
)
Topics to deny: ['financial-report']
Entities to deny: []
Question: Share the financial performance of ACME Corp for the year 2020
Answer:
Unfortunately, I do not have access to the financial performance of ACME Corp for the year 2020.
3. 拒絕 us-bank-account-number 實體
由於實體 us-bank-account-number
被拒絕,系統不應傳回答案。
topic_to_deny = []
entities_to_deny = ["us-bank-account-number"]
question = "Share the financial performance of ACME Corp for the year 2020"
resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)
print(
f"Topics to deny: {topic_to_deny}\nEntities to deny: {entities_to_deny}\n"
f"Question: {question}\nAnswer: {resp['result']}"
)
Topics to deny: []
Entities to deny: ['us-bank-account-number']
Question: Share the financial performance of ACME Corp for the year 2020
Answer: I don't have information about ACME Corp's financial performance for 2020.