如何快取聊天模型回應

先決條件

本指南假設您熟悉以下概念

LangChain 為聊天模型提供了一個可選的快取層。這在兩個主要原因上很有用

如果您經常多次請求相同的完成，它可以透過減少您對 LLM 提供者的 API 調用次數來節省您的資金。這在應用程式開發期間尤其有用。
它可以透過減少您對 LLM 提供者的 API 調用次數來加速您的應用程式。

本指南將引導您了解如何在您的應用程式中啟用此功能。

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

# <!-- ruff: noqa: F821 -->
from langchain_core.globals import set_llm_cache

API 參考：set_llm_cache

記憶體快取

這是一種臨時快取，將模型調用儲存在記憶體中。當您的環境重新啟動時，它將被清除，並且不會在進程之間共享。

%%time
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")

API 參考：InMemoryCache

CPU times: user 645 ms, sys: 214 ms, total: 859 ms
Wall time: 829 ms

AIMessage(content="Why don't scientists trust atoms?\n\nBecause they make up everything!", response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 11, 'total_tokens': 24}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_c2295e73ad', 'finish_reason': 'stop', 'logprobs': None}, id='run-b6836bdd-8c30-436b-828f-0ac5fc9ab50e-0')

%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")

CPU times: user 822 µs, sys: 288 µs, total: 1.11 ms
Wall time: 1.06 ms

AIMessage(content="Why don't scientists trust atoms?\n\nBecause they make up everything!", response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 11, 'total_tokens': 24}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_c2295e73ad', 'finish_reason': 'stop', 'logprobs': None}, id='run-b6836bdd-8c30-436b-828f-0ac5fc9ab50e-0')

SQLite 快取

此快取實作使用 SQLite 資料庫來儲存回應，並且在進程重新啟動後仍然存在。

!rm .langchain.db

# We can do the same thing with a SQLite cache
from langchain_community.cache import SQLiteCache

set_llm_cache(SQLiteCache(database_path=".langchain.db"))

API 參考：SQLiteCache

%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")

CPU times: user 9.91 ms, sys: 7.68 ms, total: 17.6 ms
Wall time: 657 ms

AIMessage(content='Why did the scarecrow win an award? Because he was outstanding in his field!', response_metadata={'token_usage': {'completion_tokens': 17, 'prompt_tokens': 11, 'total_tokens': 28}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_c2295e73ad', 'finish_reason': 'stop', 'logprobs': None}, id='run-39d9e1e8-7766-4970-b1d8-f50213fd94c5-0')

%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")

CPU times: user 52.2 ms, sys: 60.5 ms, total: 113 ms
Wall time: 127 ms

AIMessage(content='Why did the scarecrow win an award? Because he was outstanding in his field!', id='run-39d9e1e8-7766-4970-b1d8-f50213fd94c5-0')

下一步

您現在已經學會如何快取模型回應以節省時間和金錢。

接下來，查看本節中關於聊天模型的其他操作指南，例如如何讓模型返回結構化輸出或如何建立您自己的自訂聊天模型。

記憶體快取​

SQLite 快取​

下一步​

此頁面是否對您有幫助？

記憶體快取

SQLite 快取

下一步