Titan Takeoff
TitanML
透過我們的訓練、壓縮和推論最佳化平台,協助企業構建和部署更好、更小、更便宜、更快的 NLP 模型。
我們的推論伺服器 Titan Takeoff 允許您在本地硬體上以單一命令部署 LLM。 大多數生成模型架構都受到支持,例如 Falcon、Llama 2、GPT2、T5 等等。如果您在使用特定模型時遇到問題,請透過 hello@titanml.co 告訴我們。
使用範例 (Example usage)
以下是一些有助於您開始使用 Titan Takeoff Server 的範例。 在執行這些命令之前,您需要確保 Takeoff Server 已在背景中啟動。 有關更多信息,請參閱 啟動 Takeoff 的文檔頁面。
import time
# Note importing TitanTakeoffPro instead of TitanTakeoff will work as well both use same object under the hood
from langchain_community.llms import TitanTakeoff
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate
範例 1 (Example 1)
基本用法,假設 Takeoff 使用其預設埠(即 localhost:3000)在您的電腦上運行。
llm = TitanTakeoff()
output = llm.invoke("What is the weather in London in August?")
print(output)
範例 2 (Example 2)
指定埠和其他生成參數 (Specifying a port and other generation parameters)
llm = TitanTakeoff(port=3000)
# A comprehensive list of parameters can be found at https://docs.titanml.co/docs/next/apis/Takeoff%20inference_REST_API/generate#request
output = llm.invoke(
"What is the largest rainforest in the world?",
consumer_group="primary",
min_new_tokens=128,
max_new_tokens=512,
no_repeat_ngram_size=2,
sampling_topk=1,
sampling_topp=1.0,
sampling_temperature=1.0,
repetition_penalty=1.0,
regex_string="",
json_schema=None,
)
print(output)
範例 3 (Example 3)
使用 generate 處理多個輸入 (Using generate for multiple inputs)
llm = TitanTakeoff()
rich_output = llm.generate(["What is Deep Learning?", "What is Machine Learning?"])
print(rich_output.generations)
範例 4 (Example 4)
串流輸出 (Streaming output)
llm = TitanTakeoff(
streaming=True, callback_manager=CallbackManager([StreamingStdOutCallbackHandler()])
)
prompt = "What is the capital of France?"
output = llm.invoke(prompt)
print(output)
範例 5 (Example 5)
使用 LCEL (Using LCEL)
llm = TitanTakeoff()
prompt = PromptTemplate.from_template("Tell me about {topic}")
chain = prompt | llm
output = chain.invoke({"topic": "the universe"})
print(output)
範例 6 (Example 6)
使用 TitanTakeoff Python Wrapper 啟動 readers。 如果您尚未透過首先啟動 Takeoff 建立任何 readers,或者您想要新增另一個 reader,您可以在初始化 TitanTakeoff 物件時執行此操作。 只需將您想要啟動的模型配置列表作為 models
參數傳遞即可。
# Model config for the llama model, where you can specify the following parameters:
# model_name (str): The name of the model to use
# device: (str): The device to use for inference, cuda or cpu
# consumer_group (str): The consumer group to place the reader into
# tensor_parallel (Optional[int]): The number of gpus you would like your model to be split across
# max_seq_length (int): The maximum sequence length to use for inference, defaults to 512
# max_batch_size (int_: The max batch size for continuous batching of requests
llama_model = {
"model_name": "TheBloke/Llama-2-7b-Chat-AWQ",
"device": "cuda",
"consumer_group": "llama",
}
llm = TitanTakeoff(models=[llama_model])
# The model needs time to spin up, length of time need will depend on the size of model and your network connection speed
time.sleep(60)
prompt = "What is the capital of France?"
output = llm.invoke(prompt, consumer_group="llama")
print(output)