一、基本情况
vLLM 部署大模型 官方网址: https://vllm.ai github 地址:https://github.com/vllm-project/vllm
vLLM 是一个快速且易于使用的库,用于进行大型语言模型(LLM)的推理和服务。
它具有以下特点:
- 速度快: 在每个请求需要 3 个并行输出完成时的服务吞吐量。vLLM 比 HuggingFace Transformers(HF)的吞吐量高出 8.5 倍-15 倍,比 HuggingFace 文本生成推理(TGI)的吞吐量高出 3.3 倍-3.5 倍
- 优化的 CUDA 内核
- 灵活且易于使用:
- 与流行的 Hugging Face 模型无缝集成。
- 高吞吐量服务,支持多种解码算法,包括并行抽样、束搜索等。
- 支持张量并行处理,实现分布式推理。
- 支持流式输出。
- 兼容 OpenAI API 服务器。
支持的模型
vLLM 无缝支持多个 Hugging Face 模型,包括 Aquila、Baichuan、BLOOM、Falcon、GPT-2、GPT BigCode、GPT-J、GPT-NeoX、InternLM、LLaMA、Mistral、MPT、OPT、Qwen 等不同架构的模型。(https://vllm.readthedocs.io/en/latest/models/supported_models.html)
目前,glm3和llama3都分别自己提供了openai样式的服务,现在看一看vLLM有哪些不同?
二、初步实验
安装:
pip install vllm
下载:
import torch from modelscope import snapshot_download, AutoModel, AutoTokenizer import os model_dir = snapshot_download('LLM-Research/Meta-Llama-3-8B-Instruct', cache_dir='/root/autodl-tmp', revision='master')
运行以上代码。
调用:
python -m vllm.entrypoints.openai.api_server --model /root/autodl-tmp/LLM-Research/Meta-Llama-3-8B-Instruct --trust-remote-code --port 6006
资源占用:
尝试通过postman进行调用:
curl http://localhost:6006/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/autodl-tmp/LLM-Research/Meta-Llama-3-8B-Instruct", "max_tokens":60, "messages": [ { "role": "user", "content": "你是谁?" } ] }'
这种结果获得方式、以及速度都不是最好的:
于此相对,采用系统自带服务,显存占用更少。
单次测试代码可以直接运行,并且能够很好地和现有代码进行融合。
import requests import json def get_completion(prompt): headers = {'Content-Type': 'application/json'} data = {"prompt": prompt} response = requests.post(url='http://127.0.0.1:6006', headers=headers, data=json.dumps(data)) return response.json()['response'] if __name__ == '__main__': print(get_completion('1+1=?'))
三、双卡实验(略写)
分布式推理
vLLM 支持分布式张量并行推理和服务,使用 Ray 管理分布式运行时,请使用以下命令安装 Ray:
pip install ray
分布式推理实验,要运行多 GPU 服务,请在启动服务器时传入 --tensor-parallel-size 参数。
例如,要在 2 个 GPU 上运行 API 服务器:
python -m vllm.entrypoints.openai.api_server --model /root/autodl-tmp/Yi-6B-Chat --dtype auto --api-key token-agiclass --trust-remote-code --port 6006 --tensor-parallel-size 2
多卡调用一定是关键的能力,但是现在我还没有足够的动机来研究相关问题。
四、小结提炼
通过初步阅读理解相关代码,vLLM在openai的调用这块采用了类似的方法;但是可能是为了并行,导致它的体量比较大,并且出现了不兼容现象。
目前主要观点,仍然是基于现有的体系来进行应用编写。非常关键的一点是要懂原理,这样的话才能够应对各种情况。而对原理的探索能力一定是核心要素。
https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py
import asyncio import importlib import inspect import os from contextlib import asynccontextmanager from http import HTTPStatus import fastapi import uvicorn from fastapi import Request from fastapi.exceptions import RequestValidationError from fastapi.middleware.cors import CORSMiddleware from fastapi.responses import JSONResponse, Response, StreamingResponse from prometheus_client import make_asgi_app import vllm from vllm.engine.arg_utils import AsyncEngineArgs from vllm.engine.async_llm_engine import AsyncLLMEngine from vllm.entrypoints.openai.cli_args import make_arg_parser from vllm.entrypoints.openai.protocol import (ChatCompletionRequest, ChatCompletionResponse, CompletionRequest, ErrorResponse) from vllm.entrypoints.openai.serving_chat import OpenAIServingChat from vllm.entrypoints.openai.serving_completion import OpenAIServingCompletion from vllm.logger import init_logger from vllm.usage.usage_lib import UsageContext TIMEOUT_KEEP_ALIVE = 5 # seconds openai_serving_chat: OpenAIServingChat openai_serving_completion: OpenAIServingCompletion logger = init_logger(__name__) @asynccontextmanager async def lifespan(app: fastapi.FastAPI): async def _force_log(): while True: await asyncio.sleep(10) await engine.do_log_stats() if not engine_args.disable_log_stats: asyncio.create_task(_force_log()) yield app = fastapi.FastAPI(lifespan=lifespan) def parse_args(): parser = make_arg_parser() return parser.parse_args() # Add prometheus asgi middleware to route /metrics requests metrics_app = make_asgi_app() app.mount("/metrics", metrics_app) @app.exception_handler(RequestValidationError) async def validation_exception_handler(_, exc): err = openai_serving_chat.create_error_response(message=str(exc)) return JSONResponse(err.model_dump(), status_code=HTTPStatus.BAD_REQUEST) @app.get("/health") async def health() -> Response: """Health check.""" await openai_serving_chat.engine.check_health() return Response(status_code=200) @app.get("/v1/models") async def show_available_models(): models = await openai_serving_chat.show_available_models() return JSONResponse(content=models.model_dump()) @app.get("/version") async def show_version(): ver = {"version": vllm.__version__} return JSONResponse(content=ver) @app.post("/v1/chat/completions") async def create_chat_completion(request: ChatCompletionRequest, raw_request: Request): generator = await openai_serving_chat.create_chat_completion( request, raw_request) if isinstance(generator, ErrorResponse): return JSONResponse(content=generator.model_dump(), status_code=generator.code) if request.stream: return StreamingResponse(content=generator, media_type="text/event-stream") else: assert isinstance(generator, ChatCompletionResponse) return JSONResponse(content=generator.model_dump()) @app.post("/v1/completions") async def create_completion(request: CompletionRequest, raw_request: Request): generator = await openai_serving_completion.create_completion( request, raw_request) if isinstance(generator, ErrorResponse): return JSONResponse(content=generator.model_dump(), status_code=generator.code) if request.stream: return StreamingResponse(content=generator, media_type="text/event-stream") else: return JSONResponse(content=generator.model_dump()) if __name__ == "__main__": args = parse_args() app.add_middleware( CORSMiddleware, allow_origins=args.allowed_origins, allow_credentials=args.allow_credentials, allow_methods=args.allowed_methods, allow_headers=args.allowed_headers, ) if token := os.environ.get("VLLM_API_KEY") or args.api_key: @app.middleware("http") async def authentication(request: Request, call_next): root_path = "" if args.root_path is None else args.root_path if not request.url.path.startswith(f"{root_path}/v1"): return await call_next(request) if request.headers.get("Authorization") != "Bearer " + token: return JSONResponse(content={"error": "Unauthorized"}, status_code=401) return await call_next(request) for middleware in args.middleware: module_path, object_name = middleware.rsplit(".", 1) imported = getattr(importlib.import_module(module_path), object_name) if inspect.isclass(imported): app.add_middleware(imported) elif inspect.iscoroutinefunction(imported): app.middleware("http")(imported) else: raise ValueError(f"Invalid middleware {middleware}. " f"Must be a function or a class.") logger.info(f"vLLM API server version {vllm.__version__}") logger.info(f"args: {args}") if args.served_model_name is not None: served_model_names = args.served_model_name else: served_model_names = [args.model] engine_args = AsyncEngineArgs.from_cli_args(args) engine = AsyncLLMEngine.from_engine_args( engine_args, usage_context=UsageContext.OPENAI_API_SERVER) openai_serving_chat = OpenAIServingChat(engine, served_model_names, args.response_role, args.lora_modules, args.chat_template) openai_serving_completion = OpenAIServingCompletion( engine, served_model_names, args.lora_modules) app.root_path = args.root_path uvicorn.run(app, host=args.host, port=args.port, log_level=args.uvicorn_log_level, timeout_keep_alive=TIMEOUT_KEEP_ALIVE, ssl_keyfile=args.ssl_keyfile, ssl_certfile=args.ssl_certfile, ssl_ca_certs=args.ssl_ca_certs, ssl_cert_reqs=args.ssl_cert_reqs)
标签:探索,app,args,openai,llm,import,model,vllm From: https://www.cnblogs.com/jsxyhelu/p/18155194