LMDeploy
https://lmdeploy.readthedocs.io/en/latest/index.html
LMDeploy has the following core features:
Efficient Inference: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.
Effective Quantization: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.
Effortless Distribution Server: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.
Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.
Excellent Compatibility: LMDeploy supports KV Cache Quant, AWQ and Automatic Prefix Caching to be used simultaneously.
Vs
https://bentoml.com/blog/benchmarking-llm-inference-backends
https://cloud.tencent.com/developer/article/2428575
部署参考
https://zhuanlan.zhihu.com/p/678685048
标签:inference,com,quantization,https,LMDeploy,supports From: https://www.cnblogs.com/lightsong/p/18321961