背景
TensorRT-LLM是Nvidia官方推出的大模型推理加速框架,目前只对部分显卡型号有做定制加速。最近新出的Chat with RTX也是基于TensorRT-LLM进行的本地推理。
TensorRT-LLM支持PagedAttention、FlashAttention、SafeTensor等手动,某些社区号称吞吐能力测试结果超过vLLM。
准备
- 显卡A800
- QWen7B 预训练模型
开始
转换权重
首先需要将QWen模型转换为TensorRT所支持的.engine格式的权重文件
环境构建
下载TensorRT-LLM的官方代码:https://github.com/NVIDIA/TensorRT-LLM.git
然后编辑 TensorRT-LLM/docker/DOckerfile.multi ,内容如下
1 # Multi-stage Dockerfile 2 ARG BASE_IMAGE=nvcr.io/nvidia/pytorch 3 ARG BASE_TAG=23.10-py3 4 5 FROM ${BASE_IMAGE}:${BASE_TAG} as base 6 7 # https://www.gnu.org/software/bash/manual/html_node/Bash-Startup-Files.html 8 # The default values come from `nvcr.io/nvidia/pytorch` 9 ENV BASH_ENV=${BASH_ENV:-/etc/bash.bashrc} 10 ENV ENV=${ENV:-/etc/shinit_v2} 11 SHELL ["/bin/bash", "-c"] 12 13 FROM base as devel 14 15 COPY docker/common/install_base.sh install_base.sh 16 RUN bash ./install_base.sh && rm install_base.sh 17 18 COPY cmake-3.24.4-linux-x86_64.tar.gz /tmp 19 COPY docker/common/install_cmake.sh install_cmake.sh 20 RUN bash ./install_cmake.sh && rm install_cmake.sh 21 22 COPY docker/common/install_ccache.sh install_ccache.sh 23 RUN bash ./install_ccache.sh && rm install_ccache.sh 24 25 # Download & install internal TRT release 26 ARG TRT_VER CUDA_VER CUDNN_VER NCCL_VER CUBLAS_VER 27 COPY docker/common/install_tensorrt.sh install_tensorrt.sh 28 RUN bash ./install_tensorrt.sh \ 29 --TRT_VER=${TRT_VER} \ 30 --CUDA_VER=${CUDA_VER} \ 31 --CUDNN_VER=${CUDNN_VER} \ 32 --NCCL_VER=${NCCL_VER} \ 33 --CUBLAS_VER=${CUBLAS_VER} && \ 34 rm install_tensorrt.sh 35 36 # Install latest Polygraphy 37 COPY docker/common/install_polygraphy.sh install_polygraphy.sh 38 RUN bash ./install_polygraphy.sh && rm install_polygraphy.sh 39 40 # Install mpi4py 41 COPY docker/common/install_mpi4py.sh install_mpi4py.sh 42 RUN bash ./install_mpi4py.sh && rm install_mpi4py.sh 43 44 # Install PyTorch 45 ARG TORCH_INSTALL_TYPE="skip" 46 COPY docker/common/install_pytorch.sh install_pytorch.sh 47 RUN bash ./install_pytorch.sh $TORCH_INSTALL_TYPE && rm install_pytorch.sh 48 49 FROM devel as wheel 50 WORKDIR /src/tensorrt_llm 51 COPY benchmarks benchmarks 52 COPY cpp cpp 53 COPY benchmarks benchmarks 54 COPY scripts scripts 55 COPY tensorrt_llm tensorrt_llm 56 COPY 3rdparty 3rdparty 57 COPY setup.py requirements.txt requirements-dev.txt ./ 58 59 RUN pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/ 60 61 ARG BUILD_WHEEL_ARGS="--clean --trt_root /usr/local/tensorrt" 62 RUN python3 scripts/build_wheel.py ${BUILD_WHEEL_ARGS} 63 64 FROM devel as release 65 66 WORKDIR /app/tensorrt_llm 67 COPY --from=wheel /src/tensorrt_llm/build/tensorrt_llm*.whl . 68 COPY --from=wheel /src/tensorrt_llm/cpp/include/ include/ 69 RUN pip install tensorrt_llm*.whl --extra-index-url https://pypi.nvidia.com && \ 70 rm tensorrt_llm*.whl 71 COPY README.md ./ 72 COPY examples examplesView Code
标签:Triton,VER,TensorRT,tensorrt,QWen,sh,install,RUN,COPY From: https://www.cnblogs.com/zhouwenyang/p/18023854