一、前言
SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system。简单来说就是,SGLang简化了LLM程序的编写并提高了执行效率,SGLang可以将常见的LLM任务加速高达5倍。
再看QWen官方描述:简单来说就是,QWen1.5系列模型也支持SGLang推理加速
二、术语介绍
2.1. SGLang
is a structured generation language designed for large language models (LLMs). It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system.
The core features of SGLang include:
- A Flexible Front-End Language: This allows for easy programming of LLM applications with multiple chained generation calls, advanced prompting techniques, control flow, multiple modalities, parallelism, and external interaction.
- A High-Performance Runtime with RadixAttention: This feature significantly accelerates the execution of complex LLM programs by automatic KV cache reuse across multiple calls. It also supports other common techniques like continuous batching and tensor parallelism.
2. QWen1.5
Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. In comparison with the previous released Qwen, the improvements include:
- 6 model sizes, including 0.5B, 1.8B, 4B, 7B, 14B, and 72B;
- Significant performance improvement in human preference for chat models;
- Multilingual support of both base and chat models;
- Stable support of 32K context length for models of all sizes
- No need of trust_remote_code.
2.3.Anaconda
Anaconda(官方网站)就是可以便捷获取包且对包能够进行管理,同时对环境可以统一管理的发行版本。Anaconda包含了conda、Python在内的超过180个科学包及其依赖项。
三、构建环境
3.1. 基础环境及前置条件
- 操作系统:centos7
- Tesla V100-SXM2-32GB CUDA Version: 12.2
- 提前下载好qwen1.5-7b-chat模型(放在/model目录下,并重命名为qwen1.5-7b-chat)
huggingface:
https://huggingface.co/Qwen/Qwen1.5-7B-Chat/tree/main
ModelScope:
git clone https://www.modelscope.cn/qwen/Qwen1.5-7B-Chat.git
3.2. Anaconda安装
1. 更新软件包
sudo yum upgrade -y
2. 下载Anaconda
wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh
3. 安装
默认安装
bash Anaconda3-2022.10-Linux-x86_64.sh
-p 指定安装目录为/opt/anaconda3
bash Anaconda3-2022.10-Linux-x86_64.sh -p /opt/anaconda3
4. 初始化
source ~/.bashrc
5. 验证安装结果
conda --version
6. 配置镜像源
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --set show_channel_urls yes
3.3. 创建虚拟环境
3.3.1.创建新环境
conda create --name sglang python=3.10
3.3.2.切换环境
conda activate sglang
3.4. sglang安装
3.4.1.安装核心包
pip安装:
pip install "sglang[all]"
源码安装:
git clone [email protected]:sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python[all]"
3.4.2.安装依赖包
pip install triton
3.4.3.查看已软件包
conda list 或者 pip list
注意:上述命令必须先切换至vllm虚拟环境
四、部署服务
4.1. 启动sglang服务
python -m sglang.launch_server --model-path /model/qwen1.5-7b-chat --host 0.0.0.0 --port 9000 --mem-fraction-static 0.8 --tp 1 --trust-remote-code --max-prefill-num-token 10240 --context-length 10240
可用参数:
usage: launch_server.py [-h] --model-path MODEL_PATH [--tokenizer-path TOKENIZER_PATH] [--host HOST] [--port PORT] [--additional-ports [ADDITIONAL_PORTS ...]]
[--load-format {auto,pt,safetensors,npcache,dummy}] [--tokenizer-mode {auto,slow}] [--chat-template CHAT_TEMPLATE] [--trust-remote-code]
[--mem-fraction-static MEM_FRACTION_STATIC]
[--max-prefill-num-token MAX_PREFILL_NUM_TOKEN] [--context-length CONTEXT_LENGTH] [--tp-size TP_SIZE] [--schedule-heuristic SCHEDULE_HEURISTIC]
[--schedule-conservativeness SCHEDULE_CONSERVATIVENESS] [--random-seed RANDOM_SEED] [--attention-reduce-in-fp32] [--stream-interval STREAM_INTERVAL] [--log-level LOG_LEVEL]
[--disable-log-stats] [--log-stats-interval LOG_STATS_INTERVAL] [--disable-radix-cache] [--enable-flashinfer] [--disable-regex-jump-forward] [--disable-disk-cache]
[--api-key API_KEY]
对比vllm服务
4.2. 启动结果
watch -n 1 nvidia-smi
五、测试
5.1. 流式案例
# -*- coding = utf-8 -*-
import sys
import traceback
import requests
import json
import logging
#######################日志配置#######################
from requests.adapters import HTTPAdapter
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s [%(levelname)s]: %(message)s', # 指定日志输出格式
datefmt='%Y-%m-%d %H:%M:%S' # 指定日期时间格式
)
# 创建一个日志记录器
formatter = logging.Formatter('%(asctime)s [%(levelname)s]: %(message)s') # 指定日志输出格式
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
if sys.platform == "linux":
# 创建一个文件处理器,将日志写入到文件中
file_handler = logging.FileHandler('/data/logs/app.log')
else:
# 创建一个文件处理器,将日志写入到文件中
file_handler = logging.FileHandler('E:\\logs\\app.log')
file_handler.setFormatter(formatter)
# 创建一个控制台处理器,将日志输出到控制台
# console_handler = logging.StreamHandler()
# console_handler.setFormatter(formatter)
# 将处理器添加到日志记录器中
logger.addHandler(file_handler)
DEFAULT_IP='127.0.0.1'
DEFAULT_PORT=9000
DEFAULT_MAX_TOKENS=10240
DEFAULT_CONNECT_TIMEOUT=3
DEFAULT_REQUEST_TIMEOUT=60
DEFAULT_MAX_RETRIES=0
DEFAULT_POOLSIZE=100
class Model:
def __init__(self):
self.headers = {"User-Agent": "Test Client"}
self.s = requests.Session()
self.s.mount('http://', HTTPAdapter(pool_connections=DEFAULT_POOLSIZE, pool_maxsize=DEFAULT_POOLSIZE, max_retries=DEFAULT_MAX_RETRIES))
self.s.mount('https://', HTTPAdapter(pool_connections=DEFAULT_POOLSIZE, pool_maxsize=DEFAULT_POOLSIZE, max_retries=DEFAULT_MAX_RETRIES))
def chat(self, message, history=None, system=None, config=None, stream=True):
if config is None:
config = {'temperature': 0.45, 'top_p': 0.9, 'presence_penalty': 1.2, 'max_new_tokens': DEFAULT_MAX_TOKENS}
try:
prompt_str = ''
if system is not None:
prompt_str = prompt_str + '<|im_start|>system\n' + system + '<|im_end|>\n'
if history is not None:
for his in history:
q, v = his
prompt_str = prompt_str + '<|im_start|>user\n' + q + '<|im_end|>\n<|im_start|>assistant\n' + v + '<|im_end|>\n'
prompt_str = prompt_str + '<|im_start|>user\n' + message + '<|im_end|>\n<|im_start|>assistant\n'
if stream:
pload = {"text": prompt_str, "stream": True, "stop": ["<|im_end|>", "<|im_start|>", ]}
else:
pload = {"text": prompt_str, "stream": True, "stop": ["<|im_end|>", "<|im_start|>", ]}
sampling_params = {'sampling_params':config}
merge_pload = {**pload,**sampling_params}
logger.info(f'merge_pload: {merge_pload}')
response = self.s.post(f"http://{DEFAULT_IP}:{DEFAULT_PORT}/generate", headers=self.headers, json=merge_pload,
stream=stream, timeout=(DEFAULT_CONNECT_TIMEOUT, DEFAULT_REQUEST_TIMEOUT))
prev = 0
for chunk in response.iter_lines(decode_unicode=False):
if chunk:
chunk = chunk.decode("utf-8")
# print(chunk)
if chunk.startswith("data:"):
if chunk == "data: [DONE]":
break
data = json.loads(chunk[5:].strip("\n"))
output = data["text"].strip()
if len(output)>0:
result = output[prev:]
prev = len(output)
yield result
except Exception as e:
traceback.print_exc()
if __name__ == '__main__':
model = Model()
message = '我家有什么特产?'
system = '你是一个人工智能助手,很擅长帮助人类回答问题'
history = [('你好','你好!很高兴能为你提供帮助。有什么问题可以问我吗?'),('我家在广州,你呢?','我是一个人工智能助手,没有具体的居住地。不过我可以帮助你解答问题和提供信息。有什么我可以帮你的吗?')]
config = {'temperature': 0.45, 'top_p': 0.9, 'presence_penalty': 1.2, 'max_new_tokens': 10240}
gen = model.chat(message=message, history=history, system=system, config=config, stream=True)
results = []
for value in gen:
results.append(value)
str = ''.join(results)
print(str)
执行输出:
服务端日志:
sglang日志:
六、附带说明
问题一:执行python -m sglang.launch_server ... 命令报错
错误堆栈如下:
from vllm.model_executor.input_metadata import InputMetadata
ModuleNotFoundError: No module named 'vllm.model_executor.input_metadata'
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
router init state: Traceback (most recent call last):
File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process
model_client = ModelRpcClient(server_args, port_args)
File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 619, in __init__
self.model_server.exposed_init_model(0, server_args, port_args)
File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 70, in exposed_init_model
self.model_runner = ModelRunner(
File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 287, in __init__
self.load_model()
File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 296, in load_model
model_class = get_model_cls_by_arch_name(architectures)
File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 49, in get_model_cls_by_arch_name
model_arch_name_to_cls = import_model_classes()
File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 42, in import_model_classes
module = importlib.import_module(name)
File "/opt/anaconda3/envs/sglang/lib/python3.10/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/models/gemma.py", line 12, in <module>
from vllm.model_executor.input_metadata import InputMetadata
ModuleNotFoundError: No module named 'vllm.model_executor.input_metadata'
原因:vllm依赖库安装有误
解决方案:重新安装vllm
执行步骤:
1) 查看当前vllm版本
pip list | grep vllm
输出:
vllm 0.4.0.post1
2) 卸载当前vllm版本
pip uninstall vllm
输出:
Found existing installation: vllm 0.4.0.post1
Uninstalling vllm-0.4.0.post1:
Would remove:
/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/tests/core/*
/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/tests/lora/*
/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/tests/spec_decode/*
/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/tests/tokenization/*
/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/tests/worker/*
/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/vllm-0.4.0.post1.dist-info/*
/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/vllm/*
Proceed (Y/n)? y
3) 重新安装vllm,注意是d65fac2738f0287a41955b45df76a2d5a919bff6分支(重点)
下载或者克隆vllm指定分支代码:
https://github.com/vllm-project/vllm/tree/d65fac2738f0287a41955b45df76a2d5a919bff6
cd vllm
pip install .
假如安装还是出现问题,例如:
ERROR: Could not build wheels for vllm, which is required to install pyproject.toml-based projects
直接简单粗暴的方法,把下载好指定分支的vllm/vllm下面的文件,直接覆盖虚拟环境下面vllm目录下的文件
即把
覆盖到以下目录
/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/vllm
问题二:执行python -m sglang.launch_server ... 命令报错
错误堆栈如下:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Rank 0: load weight begin.
Rank 0: load weight end.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Rank 0: max_total_num_token=21819, max_prefill_num_token=32768, context_len=32768,
disable_radix_cache=False, enable_flashinfer=False, disable_regex_jump_forward=False, disable_disk_cache=False, attention_reduce_in_fp32=False
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO: Started server process [28355]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
INFO: 127.0.0.1:42524 - "GET /get_model_info HTTP/1.1" 200 OK
new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.
python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
原因:triton依赖库安装有误
解决方案:
# For NVIDIA V100, please install the nightly version.
pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/triton-nightly
执行步骤:
1) 查看当前triton版本
2) 卸载当前triton版本
3) 下载triton whl文件
wget https://aiinfra.pkgs.visualstudio.com/2692857e-05ef-43b4-ba9c-ccf1c22c437c/_packaging/07c94329-d4c3-4ad4-9e6b-f904a60032ec/pypi/download/triton-nightly/2.1.post20240108192258/triton_nightly-2.1.0.post20240108192258-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=bff4e8c78c5f4ed888ddc3152fc09b1693f7669d5d3faa8709653c6037298cdd
pip install triton_nightly-2.1.0.post20240108192258-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
标签:opt,7b,anaconda3,--,sglang,chat,model,vllm
From: https://blog.csdn.net/qq839019311/article/details/137498993