Llama.cpp
llama-cpp-python 是 llama.cpp 的 Python 繫結。
它支援 許多 LLM 模型的推論,這些模型可以在 Hugging Face 上存取。
本筆記本說明如何在 LangChain 內執行 llama-cpp-python
。
注意:新版本的 llama-cpp-python
使用 GGUF 模型檔案 (請參閱 此處)。
這是一個重大變更。
若要將現有的 GGML 模型轉換為 GGUF,您可以在 llama.cpp 中執行以下操作
python ./convert-llama-ggmlv3-to-gguf.py --eps 1e-5 --input models/openorca-platypus2-13b.ggmlv3.q4_0.bin --output models/openorca-platypus2-13b.gguf.q4_0.bin
安裝
關於如何安裝 llama-cpp 套件,有多種不同的選項
- CPU 使用率
- CPU + GPU (使用多種 BLAS 後端之一)
- Metal GPU (搭載 Apple Silicon 晶片的 MacOS)
僅限 CPU 安裝
%pip install --upgrade --quiet llama-cpp-python
使用 OpenBLAS / cuBLAS / CLBlast 安裝
llama.cpp
支援多個 BLAS 後端以加快處理速度。使用 FORCE_CMAKE=1
環境變數強制使用 cmake,並為所需的 BLAS 後端安裝 pip 套件 (來源)。
使用 cuBLAS 後端的安裝範例
!CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install llama-cpp-python
重要:如果您已安裝僅限 CPU 版本的套件,則需要從頭開始重新安裝。請考慮以下命令
!CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
使用 Metal 安裝
llama.cpp
支援 Apple silicon 一流公民 - 透過 ARM NEON、Accelerate 和 Metal 框架進行最佳化。使用 FORCE_CMAKE=1
環境變數強制使用 cmake,並安裝支援 Metal 的 pip 套件 (來源)。
使用 Metal 支援的安裝範例
!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python
重要:如果您已安裝僅限 cpu 版本的套件,則需要從頭開始重新安裝:請考慮以下命令
!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
使用 Windows 安裝
從原始碼編譯安裝 llama-cpp-python
函式庫是穩定的。您可以遵循儲存庫本身中的大多數指示,但有些 Windows 特定指示可能很有用。
安裝 llama-cpp-python
的需求:
- git
- python
- cmake
- Visual Studio Community (請確保您使用以下設定安裝)
- 使用 C++ 的桌面開發
- Python 開發
- 使用 C++ 的 Linux 嵌入式開發
- 以遞迴方式複製 git 儲存庫,以同時取得
llama.cpp
子模組
git clone --recursive -j8 https://github.com/abetlen/llama-cpp-python.git
- 開啟命令提示字元並設定以下環境變數。
set FORCE_CMAKE=1
set CMAKE_ARGS=-DGGML_CUDA=OFF
如果您有 NVIDIA GPU,請確保 DGGML_CUDA
設定為 ON
編譯與安裝
現在您可以 cd
進入 llama-cpp-python
目錄並安裝套件
python -m pip install -e .
重要:如果您已安裝僅限 cpu 版本的套件,則需要從頭開始重新安裝:請考慮以下命令
!python -m pip install -e . --force-reinstall --no-cache-dir
使用方式
請確保您遵循所有指示來安裝所有必要的模型檔案。
您不需要 API_TOKEN
,因為您將在本機執行 LLM。
值得了解哪些模型適合在所需的機器上使用。
TheBloke 的 Hugging Face 模型具有「Provided files
」區段,其中公開了執行不同量化大小和方法模型的 RAM 需求 (例如:Llama2-7B-Chat-GGUF)。
此 github 問題 也與為您的機器尋找合適的模型有關。
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate
請考慮使用適合您模型的範本!查看 Hugging Face 等模型頁面,以取得正確的提示範本。
template = """Question: {question}
Answer: Let's work this out in a step by step way to be sure we have the right answer."""
prompt = PromptTemplate.from_template(template)
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
CPU
使用 LLaMA 2 7B 模型的範例
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
temperature=0.75,
max_tokens=2000,
top_p=1,
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
)
question = """
Question: A rap battle between Stephen Colbert and John Oliver
"""
llm.invoke(question)
Stephen Colbert:
Yo, John, I heard you've been talkin' smack about me on your show.
Let me tell you somethin', pal, I'm the king of late-night TV
My satire is sharp as a razor, it cuts deeper than a knife
While you're just a british bloke tryin' to be funny with your accent and your wit.
John Oliver:
Oh Stephen, don't be ridiculous, you may have the ratings but I got the real talk.
My show is the one that people actually watch and listen to, not just for the laughs but for the facts.
While you're busy talkin' trash, I'm out here bringing the truth to light.
Stephen Colbert:
Truth? Ha! You think your show is about truth? Please, it's all just a joke to you.
You're just a fancy-pants british guy tryin' to be funny with your news and your jokes.
While I'm the one who's really makin' a difference, with my sat
``````output
llama_print_timings: load time = 358.60 ms
llama_print_timings: sample time = 172.55 ms / 256 runs ( 0.67 ms per token, 1483.59 tokens per second)
llama_print_timings: prompt eval time = 613.36 ms / 16 tokens ( 38.33 ms per token, 26.09 tokens per second)
llama_print_timings: eval time = 10151.17 ms / 255 runs ( 39.81 ms per token, 25.12 tokens per second)
llama_print_timings: total time = 11332.41 ms
"\nStephen Colbert:\nYo, John, I heard you've been talkin' smack about me on your show.\nLet me tell you somethin', pal, I'm the king of late-night TV\nMy satire is sharp as a razor, it cuts deeper than a knife\nWhile you're just a british bloke tryin' to be funny with your accent and your wit.\nJohn Oliver:\nOh Stephen, don't be ridiculous, you may have the ratings but I got the real talk.\nMy show is the one that people actually watch and listen to, not just for the laughs but for the facts.\nWhile you're busy talkin' trash, I'm out here bringing the truth to light.\nStephen Colbert:\nTruth? Ha! You think your show is about truth? Please, it's all just a joke to you.\nYou're just a fancy-pants british guy tryin' to be funny with your news and your jokes.\nWhile I'm the one who's really makin' a difference, with my sat"
使用 LLaMA v1 模型的範例
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="./ggml-model-q4_0.bin", callback_manager=callback_manager, verbose=True
)
llm_chain = prompt | llm
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.invoke({"question": question})
1. First, find out when Justin Bieber was born.
2. We know that Justin Bieber was born on March 1, 1994.
3. Next, we need to look up when the Super Bowl was played in that year.
4. The Super Bowl was played on January 28, 1995.
5. Finally, we can use this information to answer the question. The NFL team that won the Super Bowl in the year Justin Bieber was born is the San Francisco 49ers.
``````output
llama_print_timings: load time = 434.15 ms
llama_print_timings: sample time = 41.81 ms / 121 runs ( 0.35 ms per token)
llama_print_timings: prompt eval time = 2523.78 ms / 48 tokens ( 52.58 ms per token)
llama_print_timings: eval time = 23971.57 ms / 121 runs ( 198.11 ms per token)
llama_print_timings: total time = 28945.95 ms
'\n\n1. First, find out when Justin Bieber was born.\n2. We know that Justin Bieber was born on March 1, 1994.\n3. Next, we need to look up when the Super Bowl was played in that year.\n4. The Super Bowl was played on January 28, 1995.\n5. Finally, we can use this information to answer the question. The NFL team that won the Super Bowl in the year Justin Bieber was born is the San Francisco 49ers.'
GPU
如果使用 BLAS 後端的安裝正確,您將在模型屬性中看到 BLAS = 1
指示器。
搭配 GPU 使用時,最重要的兩個參數是
n_gpu_layers
- 決定將多少模型層卸載到您的 GPU。n_batch
- 平行處理多少個 token。
正確設定這些參數將大幅提升評估速度 (如需更多詳細資訊,請參閱 包裝器程式碼)。
n_gpu_layers = -1 # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
)
llm_chain = prompt | llm
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.invoke({"question": question})
1. Identify Justin Bieber's birth date: Justin Bieber was born on March 1, 1994.
2. Find the Super Bowl winner of that year: The NFL season of 1993 with the Super Bowl being played in January or of 1994.
3. Determine which team won the game: The Dallas Cowboys faced the Buffalo Bills in Super Bowl XXVII on January 31, 1993 (as the year is mis-labelled due to a error). The Dallas Cowboys won this matchup.
So, Justin Bieber was born when the Dallas Cowboys were the reigning NFL Super Bowl.
``````output
llama_print_timings: load time = 427.63 ms
llama_print_timings: sample time = 115.85 ms / 164 runs ( 0.71 ms per token, 1415.67 tokens per second)
llama_print_timings: prompt eval time = 427.53 ms / 45 tokens ( 9.50 ms per token, 105.26 tokens per second)
llama_print_timings: eval time = 4526.53 ms / 163 runs ( 27.77 ms per token, 36.01 tokens per second)
llama_print_timings: total time = 5293.77 ms
"\n\n1. Identify Justin Bieber's birth date: Justin Bieber was born on March 1, 1994.\n\n2. Find the Super Bowl winner of that year: The NFL season of 1993 with the Super Bowl being played in January or of 1994.\n\n3. Determine which team won the game: The Dallas Cowboys faced the Buffalo Bills in Super Bowl XXVII on January 31, 1993 (as the year is mis-labelled due to a error). The Dallas Cowboys won this matchup.\n\nSo, Justin Bieber was born when the Dallas Cowboys were the reigning NFL Super Bowl."
Metal
如果使用 Metal 的安裝正確,您將在模型屬性中看到 NEON = 1
指示器。
最重要的兩個 GPU 參數是
n_gpu_layers
- 決定將多少模型層卸載到您的 Metal GPU。n_batch
- 平行處理多少個 token,預設為 8,設定為更大的數字。f16_kv
- 由於某些原因,Metal 僅支援True
,否則您會收到諸如Asserting on type 0 GGML_ASSERT: .../ggml-metal.m:706: false && "not implemented"
之類的錯誤
正確設定這些參數將大幅提升評估速度 (如需更多詳細資訊,請參閱 包裝器程式碼)。
n_gpu_layers = 1 # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
)
主控台日誌將顯示以下日誌,以表明 Metal 已正確啟用。
ggml_metal_init: allocating
ggml_metal_init: using MPS
...
您也可以透過觀察進程的 GPU 使用率來檢查 活動監視器
,在開啟 n_gpu_layers=1
後,CPU 使用率將會大幅下降。
對於首次呼叫 LLM,由於 Metal GPU 中的模型編譯,效能可能會較慢。
文法
我們可以使用文法來約束模型輸出,並根據其中定義的規則範例 tokens。
為了演示這個概念,我們加入了範例文法檔案,這些檔案將在下面的範例中使用。
建立 gbnf 文法檔案可能很耗時,但如果您有輸出模式很重要的使用案例,則有兩個工具可以提供幫助
- 將 TypeScript 介面定義轉換為 gbnf 檔案的線上文法產生器應用程式。
- 用於將 json schema 轉換為 gbnf 檔案的 Python 腳本。例如,您可以建立
pydantic
物件,使用.schema_json()
方法產生其 JSON schema,然後使用此腳本將其轉換為 gbnf 檔案。
在第一個範例中,提供指定的 json.gbnf
檔案路徑,以產生 JSON
n_gpu_layers = 1 # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
grammar_path="/Users/rlm/Desktop/Code/langchain-main/langchain/libs/langchain/langchain/llms/grammars/json.gbnf",
)
%%capture captured --no-stdout
result = llm.invoke("Describe a person in JSON format:")
{
"name": "John Doe",
"age": 34,
"": {
"title": "Software Developer",
"company": "Google"
},
"interests": [
"Sports",
"Music",
"Cooking"
],
"address": {
"street_number": 123,
"street_name": "Oak Street",
"city": "Mountain View",
"state": "California",
"postal_code": 94040
}}
``````output
llama_print_timings: load time = 357.51 ms
llama_print_timings: sample time = 1213.30 ms / 144 runs ( 8.43 ms per token, 118.68 tokens per second)
llama_print_timings: prompt eval time = 356.78 ms / 9 tokens ( 39.64 ms per token, 25.23 tokens per second)
llama_print_timings: eval time = 3947.16 ms / 143 runs ( 27.60 ms per token, 36.23 tokens per second)
llama_print_timings: total time = 5846.21 ms
我們也可以提供 list.gbnf
以返回列表
n_gpu_layers = 1
n_batch = 512
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
callback_manager=callback_manager,
verbose=True,
grammar_path="/Users/rlm/Desktop/Code/langchain-main/langchain/libs/langchain/langchain/llms/grammars/list.gbnf",
)
%%capture captured --no-stdout
result = llm.invoke("List of top-3 my favourite books:")
["The Catcher in the Rye", "Wuthering Heights", "Anna Karenina"]
``````output
llama_print_timings: load time = 322.34 ms
llama_print_timings: sample time = 232.60 ms / 26 runs ( 8.95 ms per token, 111.78 tokens per second)
llama_print_timings: prompt eval time = 321.90 ms / 11 tokens ( 29.26 ms per token, 34.17 tokens per second)
llama_print_timings: eval time = 680.82 ms / 25 runs ( 27.23 ms per token, 36.72 tokens per second)
llama_print_timings: total time = 1295.27 ms