Llama.cpp
llama-cpp-python 是 llama.cpp 的 Python 綁定。
它支援 許多 LLMs 模型的推論,這些模型可以在 Hugging Face 上存取。
這個 notebook 介紹如何在 LangChain 中執行 llama-cpp-python
。
注意:新版本的 llama-cpp-python
使用 GGUF 模型檔案(請參閱 此處)。
這是一個重大變更。
若要將現有的 GGML 模型轉換為 GGUF,您可以在 llama.cpp 中執行以下操作
python ./convert-llama-ggmlv3-to-gguf.py --eps 1e-5 --input models/openorca-platypus2-13b.ggmlv3.q4_0.bin --output models/openorca-platypus2-13b.gguf.q4_0.bin
Installation (安裝)
有不同的選項可以安裝 llama-cpp 套件
- CPU usage (CPU 使用率)
- CPU + GPU(使用多個 BLAS 後端之一)
- Metal GPU(具有 Apple Silicon 晶片的 MacOS)
CPU only installation (僅 CPU 安裝)
%pip install --upgrade --quiet llama-cpp-python
Installation with OpenBLAS / cuBLAS / CLBlast (使用 OpenBLAS / cuBLAS / CLBlast 安裝)
llama.cpp
支援多個 BLAS 後端以加快處理速度。 使用 FORCE_CMAKE=1
環境變數強制使用 cmake,並為所需的 BLAS 後端安裝 pip 套件(來源)。
使用 cuBLAS 後端的安裝範例
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
重要:如果您已經安裝了僅 CPU 版本的套件,則需要從頭開始重新安裝。請考慮以下命令
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
Installation with Metal (使用 Metal 安裝)
llama.cpp
支援 Apple Silicon 一級公民 - 透過 ARM NEON、Accelerate 和 Metal 框架進行最佳化。 使用 FORCE_CMAKE=1
環境變數強制使用 cmake,並為 Metal 支援安裝 pip 套件(來源)。
使用 Metal 支援的安裝範例
!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python
重要:如果您已經安裝了僅 cpu 版本的套件,則需要從頭開始重新安裝:請考慮以下命令
!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
Installation with Windows (使用 Windows 安裝)
透過從原始碼編譯來安裝 llama-cpp-python
函式庫是穩定的。 您可以遵循儲存庫本身中的大部分指示,但有一些特定於 Windows 的指示可能很有用。
安裝 llama-cpp-python
的需求,
- git
- python
- cmake
- Visual Studio Community (請確保您使用以下設定安裝)
- Desktop development with C++ (使用 C++ 的桌面開發)
- Python development (Python 開發)
- Linux embedded development with C++ (使用 C++ 的 Linux 嵌入式開發)
- 以遞迴方式複製 git 儲存庫,以取得
llama.cpp
子模組
git clone --recursive -j8 https://github.com/abetlen/llama-cpp-python.git
- 開啟命令提示字元並設定以下環境變數。
set FORCE_CMAKE=1
set CMAKE_ARGS=-DLLAMA_CUBLAS=OFF
如果您有 NVIDIA GPU,請確保 DLLAMA_CUBLAS
設定為 ON
Compiling and installing (編譯和安裝)
現在您可以 cd
進入 llama-cpp-python
目錄並安裝套件
python -m pip install -e .
重要:如果您已經安裝了僅 cpu 版本的套件,則需要從頭開始重新安裝:請考慮以下命令
!python -m pip install -e . --force-reinstall --no-cache-dir
使用方法
請務必遵循所有指示來安裝所有必要的模型檔案。
您不需要 API_TOKEN
,因為您將在本機執行 LLM。
了解哪些模型適合在您想要的機器上使用非常重要。
TheBloke's Hugging Face 模型有一個 Provided files
區段,其中公開了運行不同量化大小和方法(例如:Llama2-7B-Chat-GGUF)的模型所需的 RAM。
這個 github issue 也與尋找適合您機器的模型有關。
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate
考慮使用適合您模型的模板!請查看 Hugging Face 等網站上的模型頁面,以取得正確的提示模板。
template = """Question: {question}
Answer: Let's work this out in a step by step way to be sure we have the right answer."""
prompt = PromptTemplate.from_template(template)
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
CPU
使用 LLaMA 2 7B 模型的範例
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
temperature=0.75,
max_tokens=2000,
top_p=1,
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
)
question = """
Question: A rap battle between Stephen Colbert and John Oliver
"""
llm.invoke(question)
Stephen Colbert:
Yo, John, I heard you've been talkin' smack about me on your show.
Let me tell you somethin', pal, I'm the king of late-night TV
My satire is sharp as a razor, it cuts deeper than a knife
While you're just a british bloke tryin' to be funny with your accent and your wit.
John Oliver:
Oh Stephen, don't be ridiculous, you may have the ratings but I got the real talk.
My show is the one that people actually watch and listen to, not just for the laughs but for the facts.
While you're busy talkin' trash, I'm out here bringing the truth to light.
Stephen Colbert:
Truth? Ha! You think your show is about truth? Please, it's all just a joke to you.
You're just a fancy-pants british guy tryin' to be funny with your news and your jokes.
While I'm the one who's really makin' a difference, with my sat
``````output
llama_print_timings: load time = 358.60 ms
llama_print_timings: sample time = 172.55 ms / 256 runs ( 0.67 ms per token, 1483.59 tokens per second)
llama_print_timings: prompt eval time = 613.36 ms / 16 tokens ( 38.33 ms per token, 26.09 tokens per second)
llama_print_timings: eval time = 10151.17 ms / 255 runs ( 39.81 ms per token, 25.12 tokens per second)
llama_print_timings: total time = 11332.41 ms
"\nStephen Colbert:\nYo, John, I heard you've been talkin' smack about me on your show.\nLet me tell you somethin', pal, I'm the king of late-night TV\nMy satire is sharp as a razor, it cuts deeper than a knife\nWhile you're just a british bloke tryin' to be funny with your accent and your wit.\nJohn Oliver:\nOh Stephen, don't be ridiculous, you may have the ratings but I got the real talk.\nMy show is the one that people actually watch and listen to, not just for the laughs but for the facts.\nWhile you're busy talkin' trash, I'm out here bringing the truth to light.\nStephen Colbert:\nTruth? Ha! You think your show is about truth? Please, it's all just a joke to you.\nYou're just a fancy-pants british guy tryin' to be funny with your news and your jokes.\nWhile I'm the one who's really makin' a difference, with my sat"
使用 LLaMA v1 模型的範例
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="./ggml-model-q4_0.bin", callback_manager=callback_manager, verbose=True
)
llm_chain = prompt | llm
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.invoke({"question": question})
1. First, find out when Justin Bieber was born.
2. We know that Justin Bieber was born on March 1, 1994.
3. Next, we need to look up when the Super Bowl was played in that year.
4. The Super Bowl was played on January 28, 1995.
5. Finally, we can use this information to answer the question. The NFL team that won the Super Bowl in the year Justin Bieber was born is the San Francisco 49ers.
``````output
llama_print_timings: load time = 434.15 ms
llama_print_timings: sample time = 41.81 ms / 121 runs ( 0.35 ms per token)
llama_print_timings: prompt eval time = 2523.78 ms / 48 tokens ( 52.58 ms per token)
llama_print_timings: eval time = 23971.57 ms / 121 runs ( 198.11 ms per token)
llama_print_timings: total time = 28945.95 ms
'\n\n1. First, find out when Justin Bieber was born.\n2. We know that Justin Bieber was born on March 1, 1994.\n3. Next, we need to look up when the Super Bowl was played in that year.\n4. The Super Bowl was played on January 28, 1995.\n5. Finally, we can use this information to answer the question. The NFL team that won the Super Bowl in the year Justin Bieber was born is the San Francisco 49ers.'
GPU
如果使用 BLAS 後端的安裝正確,您將在模型屬性中看到 BLAS = 1
指示器。
與 GPU 一起使用的兩個最重要的參數是
n_gpu_layers
- 決定將模型的多少層卸載到您的 GPU。n_batch
- 並行處理多少個 token。
正確設定這些參數將顯著提高評估速度(詳情請參閱封裝程式碼)。
n_gpu_layers = -1 # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
)
llm_chain = prompt | llm
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.invoke({"question": question})
1. Identify Justin Bieber's birth date: Justin Bieber was born on March 1, 1994.
2. Find the Super Bowl winner of that year: The NFL season of 1993 with the Super Bowl being played in January or of 1994.
3. Determine which team won the game: The Dallas Cowboys faced the Buffalo Bills in Super Bowl XXVII on January 31, 1993 (as the year is mis-labelled due to a error). The Dallas Cowboys won this matchup.
So, Justin Bieber was born when the Dallas Cowboys were the reigning NFL Super Bowl.
``````output
llama_print_timings: load time = 427.63 ms
llama_print_timings: sample time = 115.85 ms / 164 runs ( 0.71 ms per token, 1415.67 tokens per second)
llama_print_timings: prompt eval time = 427.53 ms / 45 tokens ( 9.50 ms per token, 105.26 tokens per second)
llama_print_timings: eval time = 4526.53 ms / 163 runs ( 27.77 ms per token, 36.01 tokens per second)
llama_print_timings: total time = 5293.77 ms
"\n\n1. Identify Justin Bieber's birth date: Justin Bieber was born on March 1, 1994.\n\n2. Find the Super Bowl winner of that year: The NFL season of 1993 with the Super Bowl being played in January or of 1994.\n\n3. Determine which team won the game: The Dallas Cowboys faced the Buffalo Bills in Super Bowl XXVII on January 31, 1993 (as the year is mis-labelled due to a error). The Dallas Cowboys won this matchup.\n\nSo, Justin Bieber was born when the Dallas Cowboys were the reigning NFL Super Bowl."
Metal
如果使用 Metal 的安裝正確,您將在模型屬性中看到 NEON = 1
指示器。
兩個最重要的 GPU 參數是
n_gpu_layers
- 決定將模型的多少層卸載到您的 Metal GPU。n_batch
- 並行處理多少個 token,預設值為 8,請設定為更大的數字。f16_kv
- 不知何故,Metal 僅支援True
,否則您會收到類似Asserting on type 0 GGML_ASSERT: .../ggml-metal.m:706: false && "not implemented"
的錯誤
正確設定這些參數將顯著提高評估速度(詳情請參閱封裝程式碼)。
n_gpu_layers = 1 # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
)
主控台日誌將顯示以下日誌,指示 Metal 已正確啟用。
ggml_metal_init: allocating
ggml_metal_init: using MPS
...
您還可以透過觀察程序的 GPU 使用率來檢查 活動監視器
,在開啟 n_gpu_layers=1
後,CPU 使用率將大幅下降。
由於 Metal GPU 中的模型編譯,第一次呼叫 LLM 時,效能可能會較慢。
文法
我們可以利用文法來約束模型輸出,並根據其中定義的規則來取樣 token。
為了示範這個概念,我們包含了一些範例文法檔案,這些檔案將在下面的範例中使用。
建立 gbnf 文法檔案可能很耗時,但如果您有一個輸出架構很重要的使用案例,有兩種工具可以提供協助
- 線上文法產生器應用程式,可將 TypeScript 介面定義轉換為 gbnf 檔案。
- Python 腳本,用於將 json schema 轉換為 gbnf 檔案。 例如,您可以建立
pydantic
物件,使用.schema_json()
方法產生其 JSON schema,然後使用此腳本將其轉換為 gbnf 檔案。
在第一個範例中,提供指定 json.gbnf
檔案的路徑,以產生 JSON
n_gpu_layers = 1 # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
grammar_path="/Users/rlm/Desktop/Code/langchain-main/langchain/libs/langchain/langchain/llms/grammars/json.gbnf",
)
%%capture captured --no-stdout
result = llm.invoke("Describe a person in JSON format:")
{
"name": "John Doe",
"age": 34,
"": {
"title": "Software Developer",
"company": "Google"
},
"interests": [
"Sports",
"Music",
"Cooking"
],
"address": {
"street_number": 123,
"street_name": "Oak Street",
"city": "Mountain View",
"state": "California",
"postal_code": 94040
}}
``````output
llama_print_timings: load time = 357.51 ms
llama_print_timings: sample time = 1213.30 ms / 144 runs ( 8.43 ms per token, 118.68 tokens per second)
llama_print_timings: prompt eval time = 356.78 ms / 9 tokens ( 39.64 ms per token, 25.23 tokens per second)
llama_print_timings: eval time = 3947.16 ms / 143 runs ( 27.60 ms per token, 36.23 tokens per second)
llama_print_timings: total time = 5846.21 ms
我們也可以提供 list.gbnf
來傳回一個列表
n_gpu_layers = 1
n_batch = 512
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
callback_manager=callback_manager,
verbose=True,
grammar_path="/Users/rlm/Desktop/Code/langchain-main/langchain/libs/langchain/langchain/llms/grammars/list.gbnf",
)
%%capture captured --no-stdout
result = llm.invoke("List of top-3 my favourite books:")
["The Catcher in the Rye", "Wuthering Heights", "Anna Karenina"]
``````output
llama_print_timings: load time = 322.34 ms
llama_print_timings: sample time = 232.60 ms / 26 runs ( 8.95 ms per token, 111.78 tokens per second)
llama_print_timings: prompt eval time = 321.90 ms / 11 tokens ( 29.26 ms per token, 34.17 tokens per second)
llama_print_timings: eval time = 680.82 ms / 25 runs ( 27.23 ms per token, 36.72 tokens per second)
llama_print_timings: total time = 1295.27 ms