首页 > 其他分享 >开源模型应用落地-qwen1.5-7b-chat与sglang实现推理加速的正确姿势(一)

开源模型应用落地-qwen1.5-7b-chat与sglang实现推理加速的正确姿势(一)

时间:2024-04-08 13:59:45浏览次数:31  
标签:opt 7b anaconda3 -- sglang chat model vllm

一、前言

    SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system。简单来说就是,SGLang简化了LLM程序的编写并提高了执行效率,SGLang可以将常见的LLM任务加速高达5倍。

    再看QWen官方描述:简单来说就是,QWen1.5系列模型也支持SGLang推理加速

二、术语介绍

2.1. SGLang

    is a structured generation language designed for large language models (LLMs). It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system.

The core features of SGLang include:

  • A Flexible Front-End Language: This allows for easy programming of LLM applications with multiple chained generation calls, advanced prompting techniques, control flow, multiple modalities, parallelism, and external interaction.
  • A High-Performance Runtime with RadixAttention: This feature significantly accelerates the execution of complex LLM programs by automatic KV cache reuse across multiple calls. It also supports other common techniques like continuous batching and tensor parallelism.

2. QWen1.5

    Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. In comparison with the previous released Qwen, the improvements include:

  • 6 model sizes, including 0.5B, 1.8B, 4B, 7B, 14B, and 72B;
  • Significant performance improvement in human preference for chat models;
  • Multilingual support of both base and chat models;
  • Stable support of 32K context length for models of all sizes
  • No need of trust_remote_code.

2.3.Anaconda

    Anaconda(官方网站)就是可以便捷获取包且对包能够进行管理,同时对环境可以统一管理的发行版本。Anaconda包含了conda、Python在内的超过180个科学包及其依赖项。


三、构建环境

3.1. 基础环境及前置条件

  1.  操作系统:centos7
  2.  Tesla V100-SXM2-32GB  CUDA Version: 12.2
  3.  提前下载好qwen1.5-7b-chat模型(放在/model目录下,并重命名为qwen1.5-7b-chat)

 huggingface:

https://huggingface.co/Qwen/Qwen1.5-7B-Chat/tree/main

ModelScope:

git clone https://www.modelscope.cn/qwen/Qwen1.5-7B-Chat.git

         

3.2. Anaconda安装

        1.  更新软件包

              sudo yum upgrade -y

         2. 下载Anaconda

             wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh

         3. 安装

             默认安装

             bash Anaconda3-2022.10-Linux-x86_64.sh

             -p 指定安装目录为/opt/anaconda3

             bash Anaconda3-2022.10-Linux-x86_64.sh -p /opt/anaconda3

          4. 初始化

             source ~/.bashrc

          5. 验证安装结果

              conda --version

          6. 配置镜像源

              conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
              conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
              conda config --set show_channel_urls yes

3.3. 创建虚拟环境

  3.3.1.创建新环境

            conda create --name sglang python=3.10

  3.3.2.切换环境

            conda activate sglang

3.4. sglang安装

  3.4.1.安装核心包

    pip安装:

pip install "sglang[all]"


    源码安装:

git clone [email protected]:sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

3.4.2.安装依赖包 

pip install triton

3.4.3.查看已软件包   

 conda list 或者 pip list

    注意:上述命令必须先切换至vllm虚拟环境

四、部署服务

4.1. 启动sglang服务

       python -m sglang.launch_server --model-path /model/qwen1.5-7b-chat --host 0.0.0.0 --port 9000 --mem-fraction-static 0.8 --tp 1 --trust-remote-code --max-prefill-num-token 10240 --context-length 10240
    

        可用参数:

usage: launch_server.py [-h] --model-path MODEL_PATH [--tokenizer-path TOKENIZER_PATH] [--host HOST] [--port PORT] [--additional-ports [ADDITIONAL_PORTS ...]]
                        [--load-format {auto,pt,safetensors,npcache,dummy}] [--tokenizer-mode {auto,slow}] [--chat-template CHAT_TEMPLATE] [--trust-remote-code] 
						[--mem-fraction-static MEM_FRACTION_STATIC]
                        [--max-prefill-num-token MAX_PREFILL_NUM_TOKEN] [--context-length CONTEXT_LENGTH] [--tp-size TP_SIZE] [--schedule-heuristic SCHEDULE_HEURISTIC]
                        [--schedule-conservativeness SCHEDULE_CONSERVATIVENESS] [--random-seed RANDOM_SEED] [--attention-reduce-in-fp32] [--stream-interval STREAM_INTERVAL] [--log-level LOG_LEVEL]
                        [--disable-log-stats] [--log-stats-interval LOG_STATS_INTERVAL] [--disable-radix-cache] [--enable-flashinfer] [--disable-regex-jump-forward] [--disable-disk-cache]
                        [--api-key API_KEY]

    对比vllm服务       

4.2. 启动结果

watch -n 1 nvidia-smi


五、测试

5.1. 流式案例

# -*-  coding = utf-8 -*-
import sys
import traceback

import requests
import json
import logging

#######################日志配置#######################
from requests.adapters import HTTPAdapter

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s [%(levelname)s]: %(message)s',  # 指定日志输出格式
    datefmt='%Y-%m-%d %H:%M:%S'  # 指定日期时间格式
)

# 创建一个日志记录器
formatter = logging.Formatter('%(asctime)s [%(levelname)s]: %(message)s')  # 指定日志输出格式
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

if sys.platform == "linux":
    # 创建一个文件处理器,将日志写入到文件中
    file_handler = logging.FileHandler('/data/logs/app.log')
else:
    # 创建一个文件处理器,将日志写入到文件中
    file_handler = logging.FileHandler('E:\\logs\\app.log')

file_handler.setFormatter(formatter)
# 创建一个控制台处理器,将日志输出到控制台
# console_handler = logging.StreamHandler()
# console_handler.setFormatter(formatter)

# 将处理器添加到日志记录器中
logger.addHandler(file_handler)

DEFAULT_IP='127.0.0.1'
DEFAULT_PORT=9000
DEFAULT_MAX_TOKENS=10240
DEFAULT_CONNECT_TIMEOUT=3
DEFAULT_REQUEST_TIMEOUT=60
DEFAULT_MAX_RETRIES=0
DEFAULT_POOLSIZE=100

class Model:
    def __init__(self):
        self.headers = {"User-Agent": "Test Client"}
        self.s = requests.Session()
        self.s.mount('http://', HTTPAdapter(pool_connections=DEFAULT_POOLSIZE, pool_maxsize=DEFAULT_POOLSIZE, max_retries=DEFAULT_MAX_RETRIES))
        self.s.mount('https://', HTTPAdapter(pool_connections=DEFAULT_POOLSIZE, pool_maxsize=DEFAULT_POOLSIZE, max_retries=DEFAULT_MAX_RETRIES))

    def chat(self, message, history=None, system=None, config=None, stream=True):

        if config is None:
            config = {'temperature': 0.45, 'top_p': 0.9, 'presence_penalty': 1.2, 'max_new_tokens': DEFAULT_MAX_TOKENS}

        try:
            prompt_str = ''

            if system is not None:
                prompt_str = prompt_str + '<|im_start|>system\n' + system + '<|im_end|>\n'

            if history is not None:
                for his in history:
                    q, v = his
                    prompt_str = prompt_str + '<|im_start|>user\n' + q + '<|im_end|>\n<|im_start|>assistant\n' + v + '<|im_end|>\n'

            prompt_str = prompt_str + '<|im_start|>user\n' + message + '<|im_end|>\n<|im_start|>assistant\n'

            if stream:
                pload = {"text": prompt_str, "stream": True, "stop": ["<|im_end|>", "<|im_start|>", ]}


            else:
                pload = {"text": prompt_str, "stream": True, "stop": ["<|im_end|>", "<|im_start|>", ]}
              

            sampling_params = {'sampling_params':config}
            merge_pload = {**pload,**sampling_params}

            logger.info(f'merge_pload: {merge_pload}')

            response = self.s.post(f"http://{DEFAULT_IP}:{DEFAULT_PORT}/generate", headers=self.headers, json=merge_pload,
                                     stream=stream, timeout=(DEFAULT_CONNECT_TIMEOUT, DEFAULT_REQUEST_TIMEOUT))

            prev = 0

            for chunk in response.iter_lines(decode_unicode=False):
                if chunk:
                    chunk = chunk.decode("utf-8")
                    # print(chunk)
                    if chunk.startswith("data:"):
                        if chunk == "data: [DONE]":
                            break

                        data = json.loads(chunk[5:].strip("\n"))
                        output = data["text"].strip()

                        if len(output)>0:

                            result = output[prev:]
                            prev = len(output)

                            yield result

        except Exception as e:
            traceback.print_exc()


if __name__ == '__main__':
    model = Model()

    message = '我家有什么特产?'
    system = '你是一个人工智能助手,很擅长帮助人类回答问题'
    history = [('你好','你好!很高兴能为你提供帮助。有什么问题可以问我吗?'),('我家在广州,你呢?','我是一个人工智能助手,没有具体的居住地。不过我可以帮助你解答问题和提供信息。有什么我可以帮你的吗?')]
    config = {'temperature': 0.45, 'top_p': 0.9, 'presence_penalty': 1.2, 'max_new_tokens': 10240}
    gen = model.chat(message=message, history=history, system=system, config=config, stream=True)
    results = []
    for value in gen:
        results.append(value)
    str = ''.join(results)
    print(str)

执行输出:

服务端日志:

sglang日志:

六、附带说明

问题一:执行python -m sglang.launch_server ... 命令报错

错误堆栈如下:

from vllm.model_executor.input_metadata import InputMetadata
    ModuleNotFoundError: No module named 'vllm.model_executor.input_metadata'
    Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
    router init state: Traceback (most recent call last):
      File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process
        model_client = ModelRpcClient(server_args, port_args)
      File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 619, in __init__
        self.model_server.exposed_init_model(0, server_args, port_args)
      File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 70, in exposed_init_model
        self.model_runner = ModelRunner(
      File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 287, in __init__
        self.load_model()
      File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 296, in load_model
        model_class = get_model_cls_by_arch_name(architectures)
      File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 49, in get_model_cls_by_arch_name
        model_arch_name_to_cls = import_model_classes()
      File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 42, in import_model_classes
        module = importlib.import_module(name)
      File "/opt/anaconda3/envs/sglang/lib/python3.10/importlib/__init__.py", line 126, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
      File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
      File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
      File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
      File "<frozen importlib._bootstrap_external>", line 883, in exec_module
      File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
      File "/opt/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/models/gemma.py", line 12, in <module>
        from vllm.model_executor.input_metadata import InputMetadata
    ModuleNotFoundError: No module named 'vllm.model_executor.input_metadata'


原因:vllm依赖库安装有误

解决方案:重新安装vllm

执行步骤:
    1) 查看当前vllm版本

pip list | grep vllm

    输出:
    vllm                      0.4.0.post1

    2) 卸载当前vllm版本

pip uninstall vllm

    输出:
    Found existing installation: vllm 0.4.0.post1
    Uninstalling vllm-0.4.0.post1:
      Would remove:
        /opt/anaconda3/envs/sglang/lib/python3.10/site-packages/tests/core/*
        /opt/anaconda3/envs/sglang/lib/python3.10/site-packages/tests/lora/*
        /opt/anaconda3/envs/sglang/lib/python3.10/site-packages/tests/spec_decode/*
        /opt/anaconda3/envs/sglang/lib/python3.10/site-packages/tests/tokenization/*
        /opt/anaconda3/envs/sglang/lib/python3.10/site-packages/tests/worker/*
        /opt/anaconda3/envs/sglang/lib/python3.10/site-packages/vllm-0.4.0.post1.dist-info/*
        /opt/anaconda3/envs/sglang/lib/python3.10/site-packages/vllm/*
    Proceed (Y/n)? y
    
    3) 重新安装vllm,注意是d65fac2738f0287a41955b45df76a2d5a919bff6分支(重点)
    下载或者克隆vllm指定分支代码:

https://github.com/vllm-project/vllm/tree/d65fac2738f0287a41955b45df76a2d5a919bff6
cd vllm
pip install .


    假如安装还是出现问题,例如:
    ERROR: Could not build wheels for vllm, which is required to install pyproject.toml-based projects

    直接简单粗暴的方法,把下载好指定分支的vllm/vllm下面的文件,直接覆盖虚拟环境下面vllm目录下的文件
    即把
    覆盖到以下目录
    /opt/anaconda3/envs/sglang/lib/python3.10/site-packages/vllm

问题二:执行python -m sglang.launch_server ... 命令报错

错误堆栈如下:

    Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
    Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
    Rank 0: load weight begin.
    Rank 0: load weight end.
    Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
    Rank 0: max_total_num_token=21819, max_prefill_num_token=32768, context_len=32768, 
    disable_radix_cache=False, enable_flashinfer=False, disable_regex_jump_forward=False, disable_disk_cache=False, attention_reduce_in_fp32=False
    Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
    INFO:     Started server process [28355]
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.
    INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
    INFO:     127.0.0.1:42524 - "GET /get_model_info HTTP/1.1" 200 OK
    new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.
    python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

    原因:triton依赖库安装有误
    解决方案:
        # For NVIDIA V100, please install the nightly version.
        pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/triton-nightly

执行步骤:
    1) 查看当前triton版本

    2) 卸载当前triton版本

    3) 下载triton whl文件

    

wget https://aiinfra.pkgs.visualstudio.com/2692857e-05ef-43b4-ba9c-ccf1c22c437c/_packaging/07c94329-d4c3-4ad4-9e6b-f904a60032ec/pypi/download/triton-nightly/2.1.post20240108192258/triton_nightly-2.1.0.post20240108192258-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=bff4e8c78c5f4ed888ddc3152fc09b1693f7669d5d3faa8709653c6037298cdd
	
pip install triton_nightly-2.1.0.post20240108192258-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

标签:opt,7b,anaconda3,--,sglang,chat,model,vllm
From: https://blog.csdn.net/qq839019311/article/details/137498993

相关文章

  • 让ChatGPT写分镜模板以及生成图片提示词模板
    你现在是一位知名的短视频导演,要拍摄一段1分钟左右的视频玄幻故事短片,介绍一个来自深海的远古文明,需要在开头的五秒钟抓住用户的兴趣点,请详细的撰写分镜脚本和旁白你现在是一位知名的短视频导演,要拍摄一段1分钟左右的视频玄幻故事短片,介绍一个女孩坚持谈恋爱就是谈恋爱......
  • 真的还有人不知道这几个好用的Chatgpt吗?
    Hhhhhh有点标题党了废话不多说,进入正题……今天是实习日志即将交稿的倒数第三天,看着50多篇的日志我是相当抗拒的……谁让自己先享受了呢,写吧!那老老实实的写吗?当然不,就算我再怎么编,也编不出那么多滴,于是主角登场——AI软件:这个——通义千问:老朋友了,好用的很,日常的大小问题我......
  • 驾驭数据的能力,如同使用ChatGPT一样,是现代职场人的必修课
    现代职场所比拼的除了聪明才智、过往经验之外,很多软性技能也尤为重要。现在已经不是像网络游戏开局拿着一根小木棍打天下的时代了,这将是一场武装到牙齿的较量,对于各类“装备”的驾驭能力有时候甚至可以决定胜负。ChatGPT是提升职场人工作效率的绝佳装备,相关的介绍已经很多,今天我......
  • 开源模型应用落地-chatglm3-6b模型小试-入门篇(二)
       一、前言   刚开始接触AI时,您可能会感到困惑,因为面对众多开源模型的选择,不知道应该选择哪个模型,也不知道如何调用最基本的模型。但是不用担心,我将陪伴您一起逐步入门,解决这些问题。   在信息时代,我们可以轻松地通过互联网获取大量的理论知识和概念。然而,仅仅......
  • chatgpt自动发送程序
    importpandasaspdimportpyautoguiimportpyperclipimporttimedefsend_message(message):#将消息复制到剪贴板pyperclip.copy(message)#模拟键盘按键来粘贴消息:先按下'ctrl',再按'v',最后释放这两个键pyautogui.hotkey('ctrl','v')......
  • Chatgpt掘金之旅—有爱AI商业实战篇|内容策展业务|(八)
    演示站点: https://ai.uaai.cn 对话模块官方论坛: www.jingyuai.com 京娱AI一、AI技术创业内容策展业务有哪些机会?人工智能(AI)技术作为当今科技创新的前沿领域,为创业者提供了广阔的机会和挑战。随着AI技术的快速发展和应用领域的不断拓展,未来AI技术方面会有哪些创业机......
  • Chatgpt掘金之旅—有爱AI商业实战篇|社交媒体管理|(七)
    演示站点: https://ai.uaai.cn 对话模块官方论坛: www.jingyuai.com 京娱AI一、AI技术社交媒体创业有哪些机会?人工智能(AI)技术作为当今科技创新的前沿领域,为创业者提供了广阔的机会和挑战。随着AI技术的快速发展和应用领域的不断拓展,未来AI技术方面会有哪些创业机会呢?......
  • 2024最新AIGC系统ChatGPT网站源码,GPTs应用,Ai绘画网站源码
    一、前言SparkAi创作系统是基于ChatGPT进行开发的Ai智能问答系统和Midjourney绘画系统,支持OpenAI-GPT全模型+国内AI全模型。本期针对源码系统整体测试下来非常完美,那么如何搭建部署AI创作ChatGPT?小编这里写一个详细图文教程吧。已支持GPT语音对话、GPT-4模型、DALL-E3文生图、图......
  • 2024最新AIGC系统ChatGPT网站源码,GPTs应用,Ai绘画网站源码
    一、前言SparkAi创作系统是基于ChatGPT进行开发的Ai智能问答系统和Midjourney绘画系统,支持OpenAI-GPT全模型+国内AI全模型。本期针对源码系统整体测试下来非常完美,那么如何搭建部署AI创作ChatGPT?小编这里写一个详细图文教程吧。已支持GPT语音对话、GPT-4模型、DALL-E3文生图、图......
  • 苹果的 ReALM 大语言模型比 ChatGPT 更强大
    近日,有消息称苹果公司正式进入人工智能竞争,开发了一种名为ReALM(ReferenCE ResolutionAsLanguageModeling)的大型语言模型,这对OpenAI及其广受欢迎的ChatGPT构成了直接挑战。据悉,ReALM旨在增强Siri的功能,特别是在理解对话上下文方面的能力,此举标志着苹果在人工智能技术上的......