Llama模型家族之使用 Supervised Fine-Tuning（SFT）微调预训练Llama 3 语言模型（五）基于已训练好的模型进行推理

from llamafactory.chat import ChatModel
from llamafactory.extras.misc import torch_gc

%cd /content/LLaMA-Factory/

args = dict(
  model_name_or_path="unsloth/llama-3-8b-Instruct-bnb-4bit", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  adapter_name_or_path="llama3_lora",            # load the saved LoRA adapters
  template="llama3",                     # same to the one in training
  finetuning_type="lora",                  # same to the one in training
  quantization_bit=4,                    # load 4-bit quantized model
  use_unsloth=True,                     # use UnslothAI's LoRA optimization for 2x faster generation
)
chat_model = ChatModel(args)

messages = []
print("Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.")
while True:
  query = input("\nUser: ")
  if query.strip() == "exit":
    break
  if query.strip() == "clear":
    messages = []
    torch_gc()
    print("History has been removed.")
    continue

  messages.append({"role": "user", "content": query})
  print("Assistant: ", end="", flush=True)

  response = ""
  for new_text in chat_model.stream_chat(messages):
    print(new_text, end="", flush=True)
    response += new_text
  print()
  messages.append({"role": "assistant", "content": response})

torch_gc()

导入必要的模块：
- ChatModel 从 llamafactory.chat 导入，用于创建聊天模型。
- torch_gc 从 llamafactory.extras.misc 导入，用于调用PyTorch的垃圾收集器，以释放不再使用的内存。
切换到LLaMA-Factory的目录：
- 使用 %cd /content/LLaMA-Factory/ 切换到包含LLaMA模型和相关文件的目录。
设置模型参数：
- args 是一个字典，包含了模型配置的参数。
- model_name_or_path 指定了模型的名称或路径，这里使用的是经过bnb-4bit量化的Llama-3-8B-Instruct模型。
- adapter_name_or_path 指定了保存的LoRA适配器的名称或路径。
- template 指定了模型训练时使用的模板。
- finetuning_type 指定了微调类型，这里使用的是LoRA。
- quantization_bit 设置模型量化的位数，这里为4位。
- use_unsloth 表示是否使用UnslothAI的LoRA优化，以加快生成速度。
创建聊天模型实例：
- chat_model = ChatModel(args) 使用指定的参数创建 ChatModel 实例。
初始化消息列表：
- messages 是一个空列表，用于存储用户和助手的消息。
打印欢迎信息并进入循环：
- 循环中，程序会提示用户输入消息。
- 如果用户输入 exit，则退出循环并结束程序。
- 如果用户输入 clear，则清空消息列表，调用 torch_gc() 释放内存，并打印已清除历史记录的信息。
处理用户输入：
- 用户输入的消息被添加到 messages 列表中。
- 然后，程序调用 chat_model.stream_chat(messages) 来生成助手的响应。
- 助手的响应被打印出来，并添加到 messages 列表中。
在循环结束后调用 torch_gc()：
- 程序再次调用 torch_gc() 以释放任何不再使用的内存。

这段代码演示了如何使用LLaMA模型创建一个简单的聊天应用程序，并通过量化和LoRA优化来提高性能。

官网提供的日志为：

/content/LLaMA-Factory
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
[INFO|tokenization_utils_base.py:2087] 2024-05-18 14:30:42,715 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/2950abc9d0b34ddd43fd52bbf0d7dca82807ce96/tokenizer.json
[INFO|tokenization_utils_base.py:2087] 2024-05-18 14:30:42,716 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2087] 2024-05-18 14:30:42,718 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/2950abc9d0b34ddd43fd52bbf0d7dca82807ce96/special_tokens_map.json
[INFO|tokenization_utils_base.py:2087] 2024-05-18 14:30:42,719 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/2950abc9d0b34ddd43fd52bbf0d7dca82807ce96/tokenizer_config.json
[WARNING|logging.py:314] 2024-05-18 14:30:43,133 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
05/18/2024 14:30:43 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
INFO:llamafactory.data.template:Replace eos token: <|eot_id|>
[INFO|configuration_utils.py:726] 2024-05-18 14:30:43,246 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/2950abc9d0b34ddd43fd52bbf0d7dca82807ce96/config.json
[INFO|configuration_utils.py:789] 2024-05-18 14:30:43,249 >> Model config LlamaConfig {
  "_name_or_path": "unsloth/llama-3-8b-Instruct-bnb-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": true,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": true,
    "load_in_8bit": false,
    "quant_method": "bitsandbytes"
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.40.2",
  "use_cache": true,
  "vocab_size": 128256
}

05/18/2024 14:30:43 - INFO - llamafactory.model.utils.quantization - Loading ?-bit BITSANDBYTES-quantized model.
INFO:llamafactory.model.utils.quantization:Loading ?-bit BITSANDBYTES-quantized model.
05/18/2024 14:30:43 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.
INFO:llamafactory.model.patcher:Using KV cache for faster generation.
05/18/2024 14:30:43 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
INFO:llamafactory.model.adapter:Upcasting trainable params to float32.
05/18/2024 14:30:43 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
INFO:llamafactory.model.adapter:Fine-tuning method: LoRA

标签：INFO,14,训练,05,模型,30,Llama	

From： https://blog.csdn.net/duan_zhihua/article/details/139173930

Llama模型家族之使用 Supervised Fine-Tuning（SFT）微调预训练Llama 3 语言模型（五）基于已训练好的模型进行推理

LlaMA 3 系列博客

Llama模型家族之使用 Supervised Fine-Tuning（SFT）微调预训练Llama 3 语言模型（五）基于已训练好的模型进行推理

基于已训练好的模型进行推理

相关文章

赞助商

阅读排行

Llama模型家族之使用 Supervised Fine-Tuning（SFT）微调预训练Llama 3 语言模型（五）基于已训练好的模型进行推理

LlaMA 3 系列博客

Llama模型家族之使用 Supervised Fine-Tuning（SFT）微调预训练Llama 3 语言模型（五） 基于已训练好的模型进行推理

基于已训练好的模型进行推理

相关文章

赞助商

阅读排行

Llama模型家族之使用 Supervised Fine-Tuning（SFT）微调预训练Llama 3 语言模型（五）基于已训练好的模型进行推理