vLLM

vLLM 是一個快速且易於使用的函式庫，用於 LLM 推論和服務，提供

最先進的服務吞吐量
使用 PagedAttention 有效管理注意力鍵和值記憶體
持續批次處理傳入請求
最佳化的 CUDA 核心

本筆記本說明如何將 LLM 與 langchain 和 vLLM 一起使用。

要使用，您應該已安裝 vllm python 套件。

%pip install --upgrade --quiet  vllm -q

from langchain_community.llms import VLLM

llm = VLLM(
    model="mosaicml/mpt-7b",
    trust_remote_code=True,  # mandatory for hf models
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    temperature=0.8,
)

print(llm.invoke("What is the capital of France ?"))

API 參考：VLLM

INFO 08-06 11:37:33 llm_engine.py:70] Initializing an LLM engine with config: model='mosaicml/mpt-7b', tokenizer='mosaicml/mpt-7b', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 08-06 11:37:41 llm_engine.py:196] # GPU blocks: 861, # CPU blocks: 512
``````output
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.00it/s]
``````output

What is the capital of France ? The capital of France is Paris.

將模型整合到 LLMChain 中

from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "Who was the US president in the year the first Pokemon game was released?"

print(llm_chain.invoke(question))

API 參考：LLMChain | PromptTemplate

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.34s/it]
``````output


1. The first Pokemon game was released in 1996.
2. The president was Bill Clinton.
3. Clinton was president from 1993 to 2001.
4. The answer is Clinton.

分散式推論

vLLM 支援分散式張量平行推論和服務。

若要使用 LLM 類別執行多 GPU 推論，請將 tensor_parallel_size 引數設定為您想要使用的 GPU 數量。例如，若要在 4 個 GPU 上執行推論

from langchain_community.llms import VLLM

llm = VLLM(
    model="mosaicml/mpt-30b",
    tensor_parallel_size=4,
    trust_remote_code=True,  # mandatory for hf models
)

llm.invoke("What is the future of AI?")

API 參考：VLLM

量化

vLLM 支援 awq 量化。若要啟用它，請將 quantization 傳遞至 vllm_kwargs。

llm_q = VLLM(
    model="TheBloke/Llama-2-7b-Chat-AWQ",
    trust_remote_code=True,
    max_new_tokens=512,
    vllm_kwargs={"quantization": "awq"},
)

OpenAI 相容伺服器

vLLM 可以部署為模仿 OpenAI API 協定的伺服器。這允許 vLLM 可以作為使用 OpenAI API 的應用程式的直接替代品。

此伺服器可以使用與 OpenAI API 相同的格式查詢。

OpenAI 相容完成

from langchain_community.llms import VLLMOpenAI

llm = VLLMOpenAI(
    openai_api_key="EMPTY",
    openai_api_base="https://127.0.0.1:8000/v1",
    model_name="tiiuae/falcon-7b",
    model_kwargs={"stop": ["."]},
)
print(llm.invoke("Rome is"))

API 參考：VLLMOpenAI

 a city that is filled with history, ancient buildings, and art around every corner

LoRA 轉接器

LoRA 轉接器可以與任何實作 SupportsLoRA 的 vLLM 模型一起使用。

from langchain_community.llms import VLLM
from vllm.lora.request import LoRARequest

llm = VLLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_new_tokens=300,
    top_k=1,
    top_p=0.90,
    temperature=0.1,
    vllm_kwargs={
        "gpu_memory_utilization": 0.5,
        "enable_lora": True,
        "max_model_len": 350,
    },
)
LoRA_ADAPTER_PATH = "path/to/adapter"
lora_adapter = LoRARequest("lora_adapter", 1, LoRA_ADAPTER_PATH)

print(
    llm.invoke("What are some popular Korean street foods?", lora_request=lora_adapter)
)

API 參考：VLLM

LLM 概念指南
LLM 操作指南

將模型整合到 LLMChain 中​

分散式推論​

量化​

OpenAI 相容伺服器​

OpenAI 相容完成​

LoRA 轉接器​

相關連結​

此頁面是否有幫助？

將模型整合到 LLMChain 中

分散式推論

量化

OpenAI 相容伺服器

OpenAI 相容完成

LoRA 轉接器

相關連結