LLM实践-在Colab上使用免费T4 GPU进行Chinese-Llama-2-7b-4bit推理

标签：Llama Chinese generation 4bit torch install LLM pip model

一、配置环境

1、打开colab，创建一个空白notebook，在[修改运行时环境]中选择15GB显存的T4 GPU.

2、pip安装依赖python包

!pip install --upgrade accelerate
!pip install bitsandbytes transformers_stream_generator

!pip install transformers 
!pip install sentencepiece
!pip install torch
!pip install accelerate

注意此时，安装完accelerate后需要重启notebook，不然报如下错误：

ImportError: Using low_cpu_mem_usage=True or a device_map requires Accelerate: pip install accelerate

注：参考文章内容[1]不能直接运行

二、模型推理

运行加载模型代码

import accelerate
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

# 待加载的预模型
model_path = "LinkSoul/Chinese-Llama-2-7b-4bit"

# 分词器
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(
        model_path,
        load_in_4bit=True,
        torch_dtype=torch.float16,
        device_map='auto'
    )
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)        

instruction = """[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

            If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n{} [/INST]"""

下载模型需要耗费一点时间

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Downloading (…)model.bin.index.json: 100%
26.8k/26.8k [00:00<00:00, 1.13MB/s]
Downloading shards: 0%
0/2 [00:00<?, ?it/s]
Downloading (…)l-00001-of-00002.bin: 100%
9.97G/9.98G [04:58<00:00, 38.5MB/s]
Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading (…)neration_config.json: 100%
132/132 [00:00<00:00, 4.37kB/s]

demo1

prompt = instruction.format("What is the meaning of life")
generate_ids = model.generate(tokenizer(prompt, return_tensors='pt').input_ids.cuda(), max_new_tokens=4096, streamer=streamer)

输出：

/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1421: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py:224: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_type=torch.float32 (default). This will lead to slow inference or training speed.
  warnings.warn(f'Input type into Linear4bit is torch.float16, but bnb_4bit_compute_type=torch.float32 (default). This will lead to slow inference or training speed.')
  
The meaning of life is a philosophical question that has been debated for centuries. There is no one definitive answer, as different people and cultures may have different beliefs and values. Some people believe that the meaning of life is to seek happiness, while others believe that it is to fulfill a higher purpose or to serve a greater good. Ultimately, the meaning of life is a personal and subjective question that each individual must answer for themselves.

demo2

prompt = instruction.format("如何做个不拖延的人")
generate_ids = model.generate(tokenizer(prompt, return_tensors='pt').input_ids.cuda(), max_new_tokens=4096, streamer=streamer)

输出：

答案：不拖延的人是一个很好的目标，但是要成为一个不拖延的人并不容易。以下是一些建议，可以帮助你成为一个不拖延的人：

1. 制定计划：制定一个详细的计划，包括每天要完成的任务和时间表。这样可以帮助你更好地组织时间，并避免拖延。
2. 设定目标：设定个明确的目标，并制定一个实现这个目标的计划。这样可以帮助你更好地了解自己的目标，并更有动力地去完成任务。
3. 克服拖延的心理延的心理是一个常见的问题，但是可以通过一些方法克服。例如，你可以尝试使用一些技巧来克服拖延，如分解任务、使用时间管理工具等。
4. 坚持自己的计划：坚持自己的计划是非常重要的。如果你经常拖延，那么你需要坚持自己的计划，并尽可能地按照计划去完成任务5. 寻求帮助

三、参考链接

[1] Llama-2-7b-4bit推理 https://www.bilibili.com/read/cv25258378/
[2] 原始Kaggle Notebook链接：https://www.kaggle.com/code/tiansztianszs/chinese-llama-2-7b-4bit/notebook

标签：Llama,Chinese,generation,4bit,torch,install,LLM,pip,model
From： https://www.cnblogs.com/justLittleStar/p/17745291.html

LLM实践-在Colab上使用免费T4 GPU进行Chinese-Llama-2-7b-4bit推理

一、配置环境

二、模型推理

三、参考链接

相关文章

赞助商

阅读排行