跳到主要內容
Open In ColabOpen on GitHub

Confident

DeepEval 套件,用於 LLM 的單元測試。透過 Confident,每個人都可以透過更快速的迭代,使用單元測試和整合測試來建構穩健的語言模型。我們為迭代的每個步驟提供支援,從合成資料建立到測試。

在本指南中,我們將示範如何測試和衡量 LLM 的效能。我們將展示如何使用我們的回呼來衡量效能,以及如何定義您自己的指標並將其記錄到我們的儀表板中。

DeepEval 也提供

  • 如何產生合成資料
  • 如何衡量效能
  • 一個儀表板,可隨著時間推移監控和檢閱結果

安裝與設定

%pip install --upgrade --quiet  langchain langchain-openai langchain-community deepeval langchain-chroma

取得 API 憑證

若要取得 DeepEval API 憑證,請依照下列步驟操作

  1. 前往 https://app.confident-ai.com
  2. 點擊「Organization」(組織)
  3. 複製 API 金鑰。

當您登入時,系統也會要求您設定 implementation 名稱。實作名稱是描述實作類型的必要資訊。(想想您想如何稱呼您的專案。我們建議使其具有描述性。)

!deepeval login

設定 DeepEval

預設情況下,您可以使用 DeepEvalCallbackHandler 來設定您想要追蹤的指標。但是,目前對指標的支援有限(更多功能即將新增)。目前支援

from deepeval.metrics.answer_relevancy import AnswerRelevancy

# Here we want to make sure the answer is minimally relevant
answer_relevancy_metric = AnswerRelevancy(minimum_score=0.5)

開始使用

若要使用 DeepEvalCallbackHandler,我們需要 implementation_name

from langchain_community.callbacks.confident_callback import DeepEvalCallbackHandler

deepeval_callback = DeepEvalCallbackHandler(
implementation_name="langchainQuickstart", metrics=[answer_relevancy_metric]
)
API 參考文件:DeepEvalCallbackHandler

情境 1:饋送至 LLM

然後,您可以將其饋送至您的 LLM,例如 OpenAI。

from langchain_openai import OpenAI

llm = OpenAI(
temperature=0,
callbacks=[deepeval_callback],
verbose=True,
openai_api_key="<YOUR_API_KEY>",
)
output = llm.generate(
[
"What is the best evaluation tool out there? (no bias at all)",
]
)
API 參考文件:OpenAI
LLMResult(generations=[[Generation(text='\n\nQ: What did the fish say when he hit the wall? \nA: Dam.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nThe Moon \n\nThe moon is high in the midnight sky,\nSparkling like a star above.\nThe night so peaceful, so serene,\nFilling up the air with love.\n\nEver changing and renewing,\nA never-ending light of grace.\nThe moon remains a constant view,\nA reminder of life’s gentle pace.\n\nThrough time and space it guides us on,\nA never-fading beacon of hope.\nThe moon shines down on us all,\nAs it continues to rise and elope.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nQ. What did one magnet say to the other magnet?\nA. "I find you very attractive!"', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text="\n\nThe world is charged with the grandeur of God.\nIt will flame out, like shining from shook foil;\nIt gathers to a greatness, like the ooze of oil\nCrushed. Why do men then now not reck his rod?\n\nGenerations have trod, have trod, have trod;\nAnd all is seared with trade; bleared, smeared with toil;\nAnd wears man's smudge and shares man's smell: the soil\nIs bare now, nor can foot feel, being shod.\n\nAnd for all this, nature is never spent;\nThere lives the dearest freshness deep down things;\nAnd though the last lights off the black West went\nOh, morning, at the brown brink eastward, springs —\n\nBecause the Holy Ghost over the bent\nWorld broods with warm breast and with ah! bright wings.\n\n~Gerard Manley Hopkins", generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nQ: What did one ocean say to the other ocean?\nA: Nothing, they just waved.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text="\n\nA poem for you\n\nOn a field of green\n\nThe sky so blue\n\nA gentle breeze, the sun above\n\nA beautiful world, for us to love\n\nLife is a journey, full of surprise\n\nFull of joy and full of surprise\n\nBe brave and take small steps\n\nThe future will be revealed with depth\n\nIn the morning, when dawn arrives\n\nA fresh start, no reason to hide\n\nSomewhere down the road, there's a heart that beats\n\nBelieve in yourself, you'll always succeed.", generation_info={'finish_reason': 'stop', 'logprobs': None})]], llm_output={'token_usage': {'completion_tokens': 504, 'total_tokens': 528, 'prompt_tokens': 24}, 'model_name': 'text-davinci-003'})

然後,您可以透過呼叫 is_successful() 方法來檢查指標是否成功。

answer_relevancy_metric.is_successful()
# returns True/False

執行完成後,您應該可以在下方看到我們的儀表板。

Dashboard

情境 2:在沒有回呼的鏈中追蹤 LLM

若要在沒有回呼的鏈中追蹤 LLM,您可以在結尾插入它。

我們可以從定義一個簡單的鏈開始,如下所示。

import requests
from langchain.chains import RetrievalQA
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

text_file_url = "https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt"

openai_api_key = "sk-XXX"

with open("state_of_the_union.txt", "w") as f:
response = requests.get(text_file_url)
f.write(response.text)

loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
docsearch = Chroma.from_documents(texts, embeddings)

qa = RetrievalQA.from_chain_type(
llm=OpenAI(openai_api_key=openai_api_key),
chain_type="stuff",
retriever=docsearch.as_retriever(),
)

# Providing a new question-answering pipeline
query = "Who is the president?"
result = qa.run(query)

定義鏈之後,您可以手動檢查答案相似度。

answer_relevancy_metric.measure(result, query)
answer_relevancy_metric.is_successful()

下一步?

您可以在此處建立您自己的自訂指標。

DeepEval 也提供其他功能,例如能夠自動建立單元測試幻覺測試

如果您有興趣,請查看我們的 Github 儲存庫:https://github.com/confident-ai/deepeval。我們歡迎任何 PR 和關於如何改進 LLM 效能的討論。


此頁面是否對您有幫助?