标签：torch GLM 4v self 9B 源码 model config size

GLM-4v-9B 源码解析（四）

GLM-4-9B Chat dialogue model fine-tuning

In this demo, you will experience how to fine-tune the GLM-4-9B-Chat open source model (visual understanding model is
not supported). Please strictly follow the steps in the document to avoid unnecessary errors.

Hardware check

The data in this document are tested in the following hardware environment. The actual operating environment
requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating
environment. The fine-tuned resource usage is set according to the configuration file in the
configs folder

Test hardware information:

OS: Ubuntu 22.04
Memory: 512GB
Python: Python: 3.10.12 / 3.12.3 (Currently, you need to install nltk from the git source code if you use Python
3.12.3)
CUDA Version: 12.3
GPU Driver: 535.104.05
GPU: NVIDIA A100-SXM4-80GB * 8

Fine-tuning Model	Fine-tuning solution	GPU memory usage	Weight save point size
GLM-4-9B-Chat	lora (PEFT)	22G	17M
GLM-4-9B-Chat	p-tuning v2 (PEFT)	21G	121M
GLM-4-9B-Chat	SFT (Zero3 method)	80G (Each GPU, Need 8 GPUs)	20G
GLM-4V-9B	lora (PEFT), Include EVA2CLIPModel	75G	37M
GLM-4V-9B	SFT	Not Support in this Code	28G

GLM-4V-9B fine-tuning cannot work properly with deepspeed, the official fine-tuning script only does the most basic
fine-tuning solution, more optimizations require developers to explore on their own

Before starting fine-tuning, please install the dependencies in basic_demo and clone the latest model repos (Hugging
Face) first. You also need to install the dependencies in this directory:

pip install -r requirements.txt

NOTE: Some codes in NLTK 3.8.1 might not yet be compatible with Python 3.12. For adaptation methods in such cases,
please refer to issues #38.

Multi-round dialogue format

The multi-round dialogue fine-tuning example uses the GLM-4 dialogue format convention, adding different loss_mask to
different roles to calculate loss for multiple rounds of replies in one calculation.

For data files, the sample uses the following format:

[
  {
    "messages": [
      {
        "role": "system",
        "content": "<system prompt text>",
        "tools": [
          {
            "name": "<tool name>",
            "args": {
              "<arg name>": "<arg value>"
            }
          }
          // Add more tools if needed
        ]
      },
      {
        "role": "user",
        "content": "<user prompt text>"
      },
      {
        "role": "assistant",
        "content": "<assistant response text>"
      },
      // If Tool Using
      {
        "role": "user",
        "content": "<user prompt text>"
      },
      {
        "role": "assistant",
        "content": "<assistant response text>"
      },
      {
        "role": "observation",
        "content": "<observation prompt text>"
      },
      {
        "role": "assistant",
        "content": "<assistant response observation>"
      },
      // Multi_turns
      {
        "role": "user",
        "content": "<user prompt text>"
      },
      {
        "role": "assistant",
        "content": "<assistant response text>"
      }
    ]
  }
]

This is a sample without tools:

{
  "messages": [
    {
      "role": "user",
      "content": "类型#裤*材质#牛仔布*风格#性感"
    },
    {
      "role": "assistant",
      "content": "3x1的这款牛仔裤采用浅白的牛仔面料为裤身材质，其柔然的手感和细腻的质地，在穿着舒适的同时，透露着清纯甜美的个性气质。除此之外，流畅的裤身剪裁将性感的腿部曲线彰显的淋漓尽致，不失为一款随性出街的必备单品。"
    }
  ]
}

This is a sample with tools:

{
  "messages": [
    {
      "role": "system",
      "content": "",
      "tools": [
        {
          "type": "function",
          "function": {
            "name": "get_recommended_books",
            "description": "Get recommended books based on user's interests",
            "parameters": {
              "type": "object",
              "properties": {
                "interests": {
                  "type": "array",
                  "items": {
                    "type": "string"
                  },
                  "description": "The interests to recommend books for"
                }
              },
              "required": [
                "interests"
              ]
            }
          }
        }
      ]
    },
    {
      "role": "user",
      "content": "Hi, I am looking for some book recommendations. I am interested in history and science fiction."
    },
    {
      "role": "assistant",
      "content": "{\"name\": \"get_recommended_books\", \"arguments\": {\"interests\": [\"history\", \"science fiction\"]}}"
    },
    {
      "role": "observation",
      "content": "{\"books\": [\"Sapiens: A Brief History of Humankind by Yuval Noah Harari\", \"A Brief History of Time by Stephen Hawking\", \"Dune by Frank Herbert\", \"The Martian by Andy Weir\"]}"
    },
    {
      "role": "assistant",
      "content": "Based on your interests in history and science fiction, I would recommend the following books: \"Sapiens: A Brief History of Humankind\" by Yuval Noah Harari, \"A Brief History of Time\" by Stephen Hawking, \"Dune\" by Frank Herbert, and \"The Martian\" by Andy Weir."
    }
  ]
}

This is a sample with VQA Task:

{
  "messages": [
    {
      "role": "user",
      "content": "图片中的动物是什么？",
      "image": "/root/images/0001.jpg"
    },
    {
      "role": "assistant",
      "content": "图片中有一只猫。"
    },
    {
      "role": "user",
      "content": "图片中的猫在做什么？"
    },
    {
      "role": "assistant",
      "content": "这只猫坐在或站在桌子上，桌上有很多食物。"
    }
  ]
}

The system role is optional, but if it exists, it must appear before the user role, and the system role can only
appear once in a complete conversation (whether it is a single round or a multi-round conversation).
The tools field is optional, but if it exists, it must appear after the system role, and the tools field can
only appear once in a complete conversation (whether it is a single round or a multi-round conversation). When
the tools field exists, the system role must exist and the content field is empty.
GLM-4V-9B does not support the tools field and the system field. And image must be placed in the first
message. The image field needs to contain the absolute path of the image.

Configuration file

The fine-tuning configuration file is located in the config directory, including the following files:

ds_zereo_2 / ds_zereo_3.json: deepspeed configuration file.
`lora.yaml / ptuning_v2
.yaml / sft.yaml`: Configuration files for different modes of models, including model parameters, optimizer
parameters, training parameters, etc. Some important parameters are explained as follows: + data_config section

train_file: File path of training dataset.
val_file: File path of validation dataset.
test_file: File path of test dataset.
num_proc: Number of processes to use when loading data.
max_input_length: Maximum length of input sequence.
max_output_length: Maximum length of output sequence.
training_args section
output_dir: Directory for saving model and other outputs.
max_steps: Maximum number of training steps.
per_device_train_batch_size: Training batch size per device (such as GPU).
dataloader_num_workers: Number of worker threads to use when loading data.
remove_unused_columns: Whether to remove unused columns in data.
save_strategy: Model saving strategy (for example, how many steps to save).
save_steps: How many steps to save the model.
log_level: Log level (such as info).
logging_strategy: logging strategy.
logging_steps: how many steps to log at.
per_device_eval_batch_size: per-device evaluation batch size.
evaluation_strategy: evaluation strategy (e.g. how many steps to evaluate at).
eval_steps: how many steps to evaluate at.
predict_with_generate: whether to use generation mode for prediction.
generation_config section
max_new_tokens: maximum number of new tokens to generate.
peft_config section
peft_type: type of parameter tuning to use (supports LORA and PREFIX_TUNING).
task_type: task type, here is causal language model (don't change).
Lora parameters:
r: rank of LoRA.
lora_alpha: scaling factor of LoRA.
lora_dropout: dropout probability to use in LoRA layer.
P-TuningV2 parameters: + num_virtual_tokens: the number of virtual tokens.
num_attention_heads: 2: the number of attention heads of P-TuningV2 (do not change).
token_dim: 256: the token dimension of P-TuningV2 (do not change).

Start fine-tuning

Execute single machine multi-card/multi-machine multi-card run through the following code, which uses deepspeed as
the acceleration solution, and you need to install deepspeed.

OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8  finetune.py  data/AdvertiseGen/  THUDM/glm-4-9b-chat  configs/lora.yaml # For Chat Fine-tune
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8  finetune_vision.py  data/CogVLM-311K/  THUDM/glm-4v-9b  configs/lora.yaml  # For VQA Fine-tune

Execute single machine single card run through the following code.

python finetune.py  data/AdvertiseGen/  THUDM/glm-4-9b-chat  configs/lora.yaml # For Chat Fine-tune
python finetune_vision.py  data/CogVLM-311K/  THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune

Fine-tune from a saved point

If you train as described above, each fine-tuning will start from the beginning. If you want to fine-tune from a
half-trained model, you can add a fourth parameter, which can be passed in two ways:

yes, automatically start training from the last saved Checkpoint
XX, breakpoint number, for example 600, start training from Checkpoint 600

For example, this is an example code to continue fine-tuning from the last saved point

python finetune.py data/AdvertiseGen/ THUDM/glm-4-9b-chat configs/lora.yaml yes

Use the fine-tuned model

Verify the fine-tuned model in inference.py

You can Use our fine-tuned model in finetune_demo/inference.py, and you can easily test it with just one line of code.

python inference.py your_finetune_path

In this way, the answer you get is the fine-tuned answer.

Use the fine-tuned model in other demos in this repository or external repositories

You can use our LORA and fully fine-tuned models in any demo. This requires you to modify the code yourself according
to the following tutorial.

Replace the way to read the model in the demo with the way to read the model in finetune_demo/inference.py.

Please note that for LORA and P-TuningV2, we did not merge the trained models, but recorded the fine-tuned path
in adapter_config.json
If the location of your original model changes, you should modify the path of base_model_name_or_path
in adapter_config.json.

def load_model_and_tokenizer(
        model_dir: Union[str, Path], trust_remote_code: bool = True
) -> tuple[ModelType, TokenizerType]:
    model_dir = _resolve_path(model_dir)


if (model_dir / 'adapter_config.json').exists():
    model = AutoPeftModelForCausalLM.from_pretrained(
        model_dir, trust_remote_code=trust_remote_code, device_map='auto'
    )
tokenizer_dir = model.peft_config['default'].base_model_name_or_path
else:
model = AutoModelForCausalLM.from_pretrained(
    model_dir, trust_remote_code=trust_remote_code, device_map='auto'
)
tokenizer_dir = model_dir
tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_dir, trust_remote_code=trust_remote_code
)
return model, tokenizer

Read the fine-tuned model. Please note that you should use the location of the fine-tuned model. For example, if your
model location is /path/to/finetune_adapter_model
and the original model address is path/to/base_model, you should use /path/to/finetune_adapter_model
as model_dir.
After completing the above operations, you can use the fine-tuned model normally. Other calling methods remain
unchanged.
This fine-tuning script has not been tested on long texts of 128K or 1M tokens. Fine-tuning long texts requires GPU
devices with larger memory and more efficient fine-tuning solutions, which developers need to handle on their own.

Reference


@inproceedings{liu2022p,
title={P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks},
author={Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie},
booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short
Papers)},
pages={61--68},
year={2022}
}

@misc{tang2023toolalpaca,
title={ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases},
author={Qiaoyu Tang and Ziliang Deng and Hongyu Lin and Xianpei Han and Qiao Liang and Le Sun},
year={2023},
eprint={2306.05301},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

`.\chatglm4-finetune\intel_device_demo\itrex\itrex_cli_demo.py`

"""
该脚本创建一个命令行接口（CLI）演示，使用 transformers 后端，适用于 glm-4-9b 模型，结合 Intel® Extension for Transformers
"""

# 导入操作系统相关模块
import os
# 获取环境变量 'MODEL_PATH' 的值，如果不存在则使用默认值 'THUDM/glm-4-9b-chat'
MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/glm-4-9b-chat')

# 导入 PyTorch 库
import torch
# 从 threading 模块导入 Thread 类
from threading import Thread
# 从 intel_extension_for_transformers 导入 AutoModelForCausalLM 类
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
# 从 transformers 模块导入必要的类
from transformers import TextIteratorStreamer, StoppingCriteriaList, StoppingCriteria, AutoTokenizer


# 定义停止条件类，继承自 StoppingCriteria
class StopOnTokens(StoppingCriteria):
    # 重写 __call__ 方法，检查是否需要停止生成
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        # 定义停止的 token ID 列表
        stop_ids = [151329, 151336, 151338]
        # 遍历停止 ID 列表
        for stop_id in stop_ids:
            # 如果当前输入的最后一个 token ID 是停止 ID，则返回 True
            if input_ids[0][-1] == stop_id:
                return True
        # 如果没有匹配的停止 ID，则返回 False
        return False


# 初始化模型和分词器的函数
def initialize_model_and_tokenizer():
    # 从预训练模型路径加载分词器，信任远程代码
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
    # 从预训练模型路径加载 causal language model，指定设备为 CPU，信任远程代码，并以 4bit 模式加载
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        device_map="cpu",  # 使用 Intel CPU 进行推理
        trust_remote_code=True,
        load_in_4bit=True
    )
    # 返回加载的分词器和模型
    return tokenizer, model


# 获取用户输入的函数
def get_user_input():
    # 提示用户输入并返回输入内容
    return input("\nUser: ")


# 主函数
def main():
    # 初始化模型和分词器
    tokenizer, model = initialize_model_and_tokenizer()

    # 初始化历史记录列表
    history = []
    # 设置最大生成长度
    max_length = 100
    # 设置 top-p 取样参数
    top_p = 0.9
    # 设置温度参数
    temperature = 0.8
    # 实例化停止条件对象
    stop = StopOnTokens()

    # 打印欢迎信息
    print("Welcome to the CLI chat. Type your messages below.")
    # 无限循环，直到用户选择退出
    while True:
        # 获取用户输入
        user_input = get_user_input()
        # 检查用户输入是否为退出指令
        if user_input.lower() in ["exit", "quit"]:
            break
        # 将用户输入添加到历史记录中，模型响应初始化为空
        history.append([user_input, ""])

        # 初始化消息列表，用于存储用户和模型的对话内容
        messages = []
        # 遍历历史记录，获取用户和模型的消息
        for idx, (user_msg, model_msg) in enumerate(history):
            # 如果是最新的用户消息且没有模型消息，添加用户消息到消息列表
            if idx == len(history) - 1 and not model_msg:
                messages.append({"role": "user", "content": user_msg})
                break
            # 如果用户消息存在，添加到消息列表
            if user_msg:
                messages.append({"role": "user", "content": user_msg})
            # 如果模型消息存在，添加到消息列表
            if model_msg:
                messages.append({"role": "assistant", "content": model_msg})

        # 应用聊天模板处理消息，并返回模型输入的张量
        model_inputs = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,  # 添加生成提示
            tokenize=True,                # 对内容进行分词
            return_tensors="pt"          # 返回 PyTorch 张量
        )

        # 创建一个文本迭代流处理器，用于流式生成输出
        streamer = TextIteratorStreamer(
            tokenizer=tokenizer,          # 使用的分词器
            timeout=60,                   # 超时设置为60秒
            skip_prompt=True,             # 跳过提示
            skip_special_tokens=True      # 跳过特殊标记
        )

        # 设置生成模型的参数
        generate_kwargs = {
            "input_ids": model_inputs,    # 输入的模型张量
            "streamer": streamer,          # 使用的流处理器
            "max_new_tokens": max_length,  # 生成的最大新标记数量
            "do_sample": True,             # 启用采样
            "top_p": top_p,                # 样本筛选阈值
            "temperature": temperature,     # 温度参数控制生成随机性
            "stopping_criteria": StoppingCriteriaList([stop]),  # 停止生成的条件
            "repetition_penalty": 1.2,     # 重复惩罚系数
            "eos_token_id": model.config.eos_token_id,  # 结束标记的 ID
        }

        # 创建一个线程来生成模型的输出
        t = Thread(target=model.generate, kwargs=generate_kwargs)
        # 启动线程
        t.start()
        # 打印助手的提示，保持在同一行
        print("Assistant:", end="", flush=True)
        # 从流中获取新生成的标记并打印
        for new_token in streamer:
            if new_token:
                print(new_token, end="", flush=True)  # 打印新标记
                history[-1][1] += new_token  # 将新标记添加到最新的历史模型消息

        # 去掉最新模型消息的前后空白
        history[-1][1] = history[-1][1].strip()
# 当脚本作为主程序运行时
if __name__ == "__main__":
    # 调用 main 函数
    main()

使用 Intel® Extension for Transformers 推理 GLM-4-9B-Chat 模型

本示例介绍如何使用 Intel® Extension for Transformers 推理 GLM-4-9B-Chat 模型。

设备和依赖检查

安装依赖

在开始推理之前，请你先安装basic_demo中的依赖，同时您需要安装本目录下的依赖项：

pip install -r requirements.txt

运行模型推理

python itrex_cli_demo.py

如果您是第一次推理，会有一次模型转换权重的过程，转换后的模型权重存放在runtime_outputs文件夹下，这大概会消耗60G的硬盘空间。
转换完成后，文件夹下有两个文件：

ne_chatglm2_f32.bin 52G(如果您不使用FP32进行推理，可以删掉这个文件)
ne_chatglm2_q_nf4_bestla_cfp32_sym_sfp32_g32.bin 8.1G

如果您不是第一次推理，则会跳过这个步骤，直接开始对话，推理效果如下：

Welcome to the CLI chat. Type your messages below.

User: 你好
AVX:1 AVX2:1 AVX512F:1 AVX512BW:1 AVX_VNNI:0 AVX512_VNNI:1 AMX_INT8:0 AMX_BF16:0 AVX512_BF16:0 AVX512_FP16:0
beam_size: 1, do_sample: 1, top_k: 40, top_p: 0.900, continuous_batching: 0, max_request_num: 1, early_stopping: 0, scratch_size_ratio: 1.000
model_file_loader: loading model from runtime_outs/ne_chatglm2_q_nf4_bestla_cfp32_sym_sfp32_g32.bin
Loading the bin file with NE format...
load_ne_hparams  0.hparams.n_vocab = 151552                        
load_ne_hparams  1.hparams.n_embd = 4096                          
load_ne_hparams  2.hparams.n_mult = 0                             
load_ne_hparams  3.hparams.n_head = 32                            
load_ne_hparams  4.hparams.n_head_kv = 0                             
load_ne_hparams  5.hparams.n_layer = 40                            
load_ne_hparams  6.hparams.n_rot = 0                             
load_ne_hparams  7.hparams.ftype = 0                             
load_ne_hparams  8.hparams.max_seq_len = 131072                        
load_ne_hparams  9.hparams.alibi_bias_max = 0.000                         
load_ne_hparams  10.hparams.clip_qkv = 0.000                         
load_ne_hparams  11.hparams.par_res = 0                             
load_ne_hparams  12.hparams.word_embed_proj_dim = 0                             
load_ne_hparams  13.hparams.do_layer_norm_before = 0                             
load_ne_hparams  14.hparams.multi_query_group_num = 2                             
load_ne_hparams  15.hparams.ffn_hidden_size = 13696                         
load_ne_hparams  16.hparams.inner_hidden_size = 0                             
load_ne_hparams  17.hparams.n_experts = 0                             
load_ne_hparams  18.hparams.n_experts_used = 0                             
load_ne_hparams  19.hparams.n_embd_head_k = 0                             
load_ne_hparams  20.hparams.norm_eps = 0.000000                      
load_ne_hparams  21.hparams.freq_base = 5000000.000                   
load_ne_hparams  22.hparams.freq_scale = 1.000                         
load_ne_hparams  23.hparams.rope_scaling_factor = 0.000                         
load_ne_hparams  24.hparams.original_max_position_embeddings = 0                             
load_ne_hparams  25.hparams.use_yarn = 0                             
load_ne_vocab    26.vocab.bos_token_id = 1                             
load_ne_vocab    27.vocab.eos_token_id = 151329                        
load_ne_vocab    28.vocab.pad_token_id = 151329                        
load_ne_vocab    29.vocab.sep_token_id = -1                            
init: hparams.n_vocab         = 151552
init: hparams.n_embd          = 4096
init: hparams.n_mult          = 0
init: hparams.n_head          = 32
init: hparams.n_layer         = 40
init: hparams.n_rot           = 0
init: hparams.ffn_hidden_size = 13696
init: n_parts    = 1
load: ctx size   = 16528.38 MB
load: layers[0].ffn_fusion    = 1
load: scratch0   = 4096.00 MB
load: scratch1   = 2048.00 MB
load: scratch2   = 4096.00 MB
load: mem required  = 26768.38 MB (+ memory per state)
.............................................................................................
model_init_from_file: support_bestla_kv = 1
kv_cache_init: run_mha_reordered = 1
model_init_from_file: kv self size =  690.00 MB
Assistant:
你好
标签：torch,GLM,4v,self,9B,源码,model,config,size	

From： https://www.cnblogs.com/apachecn/p/18491990

GLM-4v-9B-源码解析-四-