书生大模型第四期 | 基础岛 task5 XTuner 微调个人小助手认知任务(包含swanlab可视化与模型上传modelscope)

标签：task5 模型 modelscope internlm2 xtuner dict type e3 7b

环境配置与数据准备

步骤 0.使用 conda 先构建一个 Python-3.10 的虚拟环境

cd ~
#git clone 本repo
git clone https://github.com/InternLM/Tutorial.git -b camp4
mkdir -p /root/finetune && cd /root/finetune
conda create -n xtuner-env python=3.10 -y
conda activate xtuner-env

附：如果conda创建虚拟环境报错 CondaVerificationError如下，需要参考 QA文档https://aicarrier.feishu.cn/wiki/AElrwH08WiCG9HkG5iacjlOJnoc 将容器重置

CondaVerificationError: The package for ncurses located at /root/.conda/pkgs/ncurses-6.4-h6a678d5_0 appears to be corrupted. The path 'share/terminfo/z/ztx' specified in the package manifest cannot be found. 
CondaVerificationError: The package for ncurses located at /root/.conda/pkgs/ncurses-6.4-h6a678d5_0 appears to be corrupted. The path 'share/terminfo/z/ztx-1-a' specified in the package manifest cannot be found. 
CondaVerificationError: The package for ncurses located at /root/.conda/pkgs/ncurses-6.4-h6a678d5_0 appears to be corrupted. The path 'share/terminfo/z/ztx11' specified in the package manifest cannot be found.

步骤 1. 安装 XTuner

此处推荐源码安装，更多的安装方法请回到前面看 XTuner 文档

git clone https://github.com/InternLM/xtuner.git
cd /root/finetune/xtuner

pip install  -e '.[all]'
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.39.0

![[Pasted image 20241125141734.png]]

![[Pasted image 20241125141757.png]]

附：如果需要安装可视化，可以安装swanlab

conda activate xtuner-env
pip install swanlab

验证安装

为了验证 XTuner 是否安装正确，我们将使用命令打印配置文件。

打印配置文件： 在命令行中使用 xtuner list-cfg 验证是否能打印配置文件列表。

xtuner list-cfg

输出没有报错则为此结果

xtuner list-cfg ==========================CONFIGS=========================== baichuan2_13b_base_full_custom_pretrain_e1 baichuan2_13b_base_qlora_alpaca_e3 baichuan2_13b_base_qlora_alpaca_enzh_e3 baichuan2_13b_base_qlora_alpaca_enzh_oasst1_e3 ... internlm2_1_8b_full_alpaca_e3 internlm2_1_8b_full_custom_pretrain_e1 internlm2_1_8b_qlora_alpaca_e3 internlm2_20b_full_custom_pretrain_e1 internlm2_20b_full_finetune_custom_dataset_e1 internlm2_20b_qlora_alpaca_e3 internlm2_20b_qlora_arxiv_gentitle_e3 internlm2_20b_qlora_code_alpaca_e3 internlm2_20b_qlora_colorist_e5 internlm2_20b_qlora_lawyer_e3 internlm2_20b_qlora_msagent_react_e3_gpu8 internlm2_20b_qlora_oasst1_512_e3 internlm2_20b_qlora_oasst1_e3 internlm2_20b_qlora_sql_e3 internlm2_5_chat_20b_alpaca_e3 internlm2_5_chat_20b_qlora_alpaca_e3 internlm2_5_chat_7b_full_finetune_custom_dataset_e1 internlm2_5_chat_7b_qlora_alpaca_e3 internlm2_5_chat_7b_qlora_oasst1_e3 internlm2_7b_full_custom_pretrain_e1 internlm2_7b_full_finetune_custom_dataset_e1 internlm2_7b_full_finetune_custom_dataset_e1_sequence_parallel_4 internlm2_7b_qlora_alpaca_e3 internlm2_7b_qlora_arxiv_gentitle_e3 internlm2_7b_qlora_code_alpaca_e3 internlm2_7b_qlora_colorist_e5 internlm2_7b_qlora_json_e3 internlm2_7b_qlora_lawyer_e3 internlm2_7b_qlora_msagent_react_e3_gpu8 internlm2_7b_qlora_oasst1_512_e3 internlm2_7b_qlora_oasst1_e3 internlm2_7b_qlora_sql_e3 ...

输出内容为 XTuner 支持微调的模型

修改提供的数据

步骤 0. 创建一个新的文件夹用于存储微调数据

mkdir -p /root/finetune/data && cd /root/finetune/data
cp -r /root/Tutorial/data/assistant_Tuner.jsonl  /root/finetune/data

步骤 1. 编辑&执行脚本

在/root/finetune/data/中编辑change_script.py，这个脚本主要是替换个人昵称

import json
import argparse
from tqdm import tqdm

def process_line(line, old_text, new_text):
    # 解析 JSON 行
    data = json.loads(line)
    
    # 递归函数来处理嵌套的字典和列表
    def replace_text(obj):
        if isinstance(obj, dict):
            return {k: replace_text(v) for k, v in obj.items()}
        elif isinstance(obj, list):
            return [replace_text(item) for item in obj]
        elif isinstance(obj, str):
            return obj.replace(old_text, new_text)
        else:
            return obj
    
    # 处理整个 JSON 对象
    processed_data = replace_text(data)
    
    # 将处理后的对象转回 JSON 字符串
    return json.dumps(processed_data, ensure_ascii=False)

def main(input_file, output_file, old_text, new_text):
    with open(input_file, 'r', encoding='utf-8') as infile, \
         open(output_file, 'w', encoding='utf-8') as outfile:
        
        # 计算总行数用于进度条
        total_lines = sum(1 for _ in infile)
        infile.seek(0)  # 重置文件指针到开头
        
        # 使用 tqdm 创建进度条
        for line in tqdm(infile, total=total_lines, desc="Processing"):
            processed_line = process_line(line.strip(), old_text, new_text)
            outfile.write(processed_line + '\n')

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Replace text in a JSONL file.")
    parser.add_argument("input_file", help="Input JSONL file to process")
    parser.add_argument("output_file", help="Output file for processed JSONL")
    parser.add_argument("--old_text", default="尖米", help="Text to be replaced")
    parser.add_argument("--new_text", default="dblab2", help="Text to replace with")
    args = parser.parse_args()

    main(args.input_file, args.output_file, args.old_text, args.new_text)

执行脚本

# usage：python change_script.py {input_file.jsonl} {output_file.jsonl}

cd ~/finetune/data
python change_script.py ./assistant_Tuner.jsonl ./assistant_Tuner_change.jsonl

assistant_Tuner_change.jsonl 是修改后符合 XTuner 格式的训练数据。

此时 data 文件夹下应该有如下结构

├── assistant_Tuner.jsonl
├── assistant_Tuner_change.jsonl
└── change_script.py

步骤 3. 查看数据

cat assistant_Tuner_change.jsonl | head -n 4

![[Pasted image 20241125143906.png]]

这个数据集一共1500条

训练启动

步骤 0. 复制模型

在InternStudio开发机中的已经提供了微调模型，可以直接软链接即可。

本模型位于/root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-7b-chat

mkdir /root/finetune/models
ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-7b-chat /root/finetune/models/internlm2_5-7b-chat

步骤 1. 修改 Config

获取官方写好的 config

# cd {path/to/finetune}
cd /root/finetune
mkdir ./config
cd config
xtuner copy-cfg internlm2_5_chat_7b_qlora_alpaca_e3 ./

配置修改为下面

# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
                            LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          BitsAndBytesConfig)

from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
                                 VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE

#######################################################################
#                          PART 1  Settings                           #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/finetune/models/internlm2_5-7b-chat'
use_varlen_attn = False

# Data
alpaca_en_path = '/root/finetune/data/assistant_Tuner_change.jsonl'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True

# parallel
sequence_parallel_size = 1

# Scheduler & Optimizer
batch_size = 1  # per_device
accumulative_counts = 1
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1  # grad clip
warmup_ratio = 0.03

# Save
save_steps = 500
save_total_limit = 2  # Maximum checkpoints to keep (-1 means unlimited)

# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
    '请介绍一下你自己', 'Please introduce yourself'
]

#######################################################################
#                      PART 2  Model & Tokenizer                      #
#######################################################################
tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    padding_side='right')

model = dict(
    type=SupervisedFinetune,
    use_varlen_attn=use_varlen_attn,
    llm=dict(
        type=AutoModelForCausalLM.from_pretrained,
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        trust_remote_code=True,
        torch_dtype=torch.float16,
        quantization_config=dict(
            type=BitsAndBytesConfig,
            load_in_4bit=True,
            load_in_8bit=False,
            llm_int8_threshold=6.0,
            llm_int8_has_fp16_weight=False,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type='nf4')),
    lora=dict(
        type=LoraConfig,
        r=64,
        lora_alpha=16,
        lora_dropout=0.1,
        bias='none',
        task_type='CAUSAL_LM'))

#######################################################################
#                      PART 3  Dataset & Dataloader                   #
#######################################################################
alpaca_en = dict(
    type=process_hf_dataset,
    dataset=dict(type=load_dataset, path='json', data_files=dict(train=alpaca_en_path)),
    tokenizer=tokenizer,
    max_length=max_length,
    dataset_map_fn=None,
    template_map_fn=dict(
        type=template_map_fn_factory, template=prompt_template),
    remove_unused_columns=True,
    shuffle_before_pack=True,
    pack_to_max_length=pack_to_max_length,
    use_varlen_attn=use_varlen_attn)

sampler = SequenceParallelSampler \
    if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
    batch_size=batch_size,
    num_workers=dataloader_num_workers,
    dataset=alpaca_en,
    sampler=dict(type=sampler, shuffle=True),
    collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))

#######################################################################
#                    PART 4  Scheduler & Optimizer                    #
#######################################################################
# optimizer
optim_wrapper = dict(
    type=AmpOptimWrapper,
    optimizer=dict(
        type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
    clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
    accumulative_counts=accumulative_counts,
    loss_scale='dynamic',
    dtype='float16')

# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md  # noqa: E501
param_scheduler = [
    dict(
        type=LinearLR,
        start_factor=1e-5,
        by_epoch=True,
        begin=0,
        end=warmup_ratio * max_epochs,
        convert_to_iter_based=True),
    dict(
        type=CosineAnnealingLR,
        eta_min=0.0,
        by_epoch=True,
        begin=warmup_ratio * max_epochs,
        end=max_epochs,
        convert_to_iter_based=True)
]

# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)

#######################################################################
#                           PART 5  Runtime                           #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
    dict(type=DatasetInfoHook, tokenizer=tokenizer),
    dict(
        type=EvaluateChatHook,
        tokenizer=tokenizer,
        every_n_iters=evaluation_freq,
        evaluation_inputs=evaluation_inputs,
        system=SYSTEM,
        prompt_template=prompt_template)
]

if use_varlen_attn:
    custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]

# configure default hooks
default_hooks = dict(
    # record the time of every iteration.
    timer=dict(type=IterTimerHook),
    # print log every 10 iterations.
    logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
    # enable the parameter scheduler.
    param_scheduler=dict(type=ParamSchedulerHook),
    # save checkpoint per `save_steps`.
    checkpoint=dict(
        type=CheckpointHook,
        by_epoch=False,
        interval=save_steps,
        max_keep_ckpts=save_total_limit),
    # set sampler seed in distributed evrionment.
    sampler_seed=dict(type=DistSamplerSeedHook),
)

# configure environment
env_cfg = dict(
    # whether to enable cudnn benchmark
    cudnn_benchmark=False,
    # set multi process parameters
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    # set distributed parameters
    dist_cfg=dict(backend='nccl'),
)

# set visualizer
from mmengine.visualization import Visualizer
from swanlab.integration.mmengine import SwanlabVisBackend

visualizer = dict(
  type=Visualizer,
  vis_backends=[dict(
        type=SwanlabVisBackend,
        init_kwargs=dict(project='InternLM-camp4', experiment_name='XTuner 微调个人小助手认知任务',description='学习率=',mode='local',logdir='/root/finetune/swanlog'),
    )])

# set log level
log_level = 'INFO'

# load from which checkpoint
load_from = None

# whether to resume training from the loaded checkpoint
resume = False

# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)

# set log processor
log_processor = dict(by_epoch=False)

步骤 2. 启动微调

如果没有tmux，需要apt install tmux

tmux new -s train_session
conda activate xtuner-env
cd /root/finetune
xtuner train ./config/internlm2_5_chat_7b_qlora_alpaca_e3_copy.py --deepspeed deepspeed_zero2 --work-dir ./work_dirs/assistTuner

观察实验

swanlab watch "./swanlog" -h 0.0.0.0 -p 9091

通过vscode进行端口转发
![[Pasted image 20241125162455.png]]

打开浏览器访问http://127.0.0.1:9091 观察训练中的数据，可以看到loss总体再下降
![[Pasted image 20241125162520.png]]

训练显存占用
![[Pasted image 20241125162750.png]]

训练完成
![[Pasted image 20241125172450.png]]

通过观察日志检验训练成果

没有训练之前：
![[Pasted image 20241125195921.png]]

多轮训练之后：
![[Pasted image 20241125195948.png]] 在这里插入图片描述

步骤 3. 权重转换

模型转换的本质其实就是将原本使用 Pytorch 训练出来的模型权重文件转换为目前通用的 HuggingFace 格式文件，那么我们可以通过以下命令来实现一键转换。

我们可以使用 xtuner convert pth_to_hf 命令来进行模型格式转换。

xtuner convert pth_to_hf 命令用于进行模型格式转换。该命令需要三个参数：CONFIG 表示微调的配置文件， PATH_TO_PTH_MODEL 表示微调的模型权重文件路径，即要转换的模型权重， SAVE_PATH_TO_HF_MODEL 表示转换后的 HuggingFace 格式文件的保存路径。

除此之外，我们其实还可以在转换的命令中添加几个额外的参数，包括：

参数名	解释
–fp32	代表以fp32的精度开启，假如不输入则默认为fp16
–max-shard-size {GB}	代表每个权重文件最大的大小（默认为2GB）
执行模型转换

cd /root/finetune/work_dirs/assistTuner

conda activate xtuner-env

# 先获取最后保存的一个pth文件
pth_file=`ls -t /root/finetune/work_dirs/assistTuner/*.pth | head -n 1 | sed 's/:$//'`
export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU
xtuner convert pth_to_hf ./internlm2_5_chat_7b_qlora_alpaca_e3_copy.py ${pth_file} ./hf

# log
-----------------------------------------------------
Load Checkpoints: 100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:58<00:00, 58.14s/it]
Detected checkpoint of type zero stage 2, world_size: 1
/root/finetune/xtuner/xtuner/utils/zero_to_any_dtype.py:113: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(file, map_location=device)
Parsing checkpoint created by deepspeed==0.15.4
Reconstructed state dict with 322 params 157179904 elements
Load State Dict: 100%|███████████████████████████████████████████████████████████████████| 322/322 [00:02<00:00, 129.95it/s]
11/25 20:08:11 - mmengine - INFO - Load PTH model from /root/finetune/work_dirs/assistTuner/iter_870.pth
11/25 20:08:12 - mmengine - INFO - Convert LLM to float16
11/25 20:08:12 - mmengine - INFO - Saving adapter to ./hf
11/25 20:08:16 - mmengine - INFO - All done!

hf目录结构如下：

├── hf
│   ├── README.md
│   ├── adapter_config.json
│   ├── adapter_model.bin
│   └── xtuner_config.py

步骤 4. 模型合并

对于 LoRA 或者 QLoRA 微调出来的模型其实并不是一个完整的模型，而是一个额外的层（Adapter），训练完的这个层最终还是要与原模型进行合并才能被正常的使用。

对于全量微调的模型（full）其实是不需要进行整合这一步的，因为全量微调修改的是原模型的权重而非微调一个新的 Adapter ，因此是不需要进行模型整合的。

在 XTuner 中提供了一键合并的命令 xtuner convert merge，在使用前我们需要准备好三个路径，包括原模型的路径、训练好的 Adapter 层的（模型格式转换后的）路径以及最终保存的路径。

xtuner convert merge命令用于合并模型。该命令需要三个参数：LLM 表示原模型路径，ADAPTER 表示 Adapter 层的路径， SAVE_PATH 表示合并后的模型最终的保存路径。

在模型合并这一步还有其他很多的可选参数，包括：

参数名	解释
–max-shard-size {GB}	代表每个权重文件最大的大小（默认为2GB）
–device {device_name}	这里指的就是device的名称，可选择的有cuda、cpu和auto，默认为cuda即使用gpu进行运算
–is-clip	这个参数主要用于确定模型是不是CLIP模型，假如是的话就要加上，不是就不需要添加
执行命令：

cd /root/finetune/work_dirs/assistTuner
conda activate xtuner-env

export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU
xtuner convert merge /root/finetune/models/internlm2_5-7b-chat ./hf ./merged --max-shard-size 2GB

# log
----------------------------
Warning: The cache directory for DeepSpeed Triton autotune, /root/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████| 8/8 [00:25<00:00,  3.15s/it]
Saving to ./merged...
All done!

模型 WebUI 对话

微调完成后，我们可以再次运行 xtuner_streamlit_demo.py 脚本来观察微调后的对话效果，不过在运行之前，我们需要将脚本中的模型路径修改为微调后的模型的路径。

cd ~/Tutorial/tools/L1_XTuner_code

# 直接修改脚本文件第18行
- model_name_or_path = "Shanghai_AI_Laboratory/internlm2_5-7b-chat"
+ model_name_or_path = "/root/finetune/work_dirs/assistTuner/merged"

然后，我们可以直接启动应用。

conda activate xtuner-env

streamlit run /root/Tutorial/tools/L1_XTuner_code/xtuner_streamlit_demo.py

用vscode转端口，访问浏览器http://localhost:8501
![[Pasted image 20241125203221.png]]

![[Pasted image 20241125203238.png]]

部署到modelscope

1.创建模型库

填写相关信息：模型信息，模型描述等，详细参考
https://modelscope.cn/docs/models/intro
![[Pasted image 20241125204030.png]]

2.git上传模型

如果没有安装git lfs需要预先安装

下载并安装 Git LFS：
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
sudo apt-get install git-lfs
设置 Git LFS：安装完 Git LFS 后，你需要运行一次 git lfs install 来设置它：git lfs install

然后git clone你的模型

git lfs install
git clone https://oauth2:YOUR-GIT-TOKEN@www.modelscope.cn/user/ my-test-model.git
# 简单的获取方式：选择自己的模型下载会给到命令链接

# log 
-----------
Cloning into 'internlm2_5_chat_7b_dblab2self'...
remote: Enumerating objects: 9, done.
remote: Counting objects: 100% (9/9), done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 9 (delta 1), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (9/9), 1.89 KiB | 2.00 KiB/s, done.

移动模型并上传

mv merged/* internlm2_5_chat_7b_dblab2self/
cd internlm2_5_chat_7b_dblab2self

git add . # 需要等待一段时间，因为权重占空间非常大，要看进度可以到 .git/lfs/object目录下看
git commit -m 'dblab2_model'
git push

# log
-----------------------
Uploading LFS objects: 100% (10/10), 16 GB | 60 MB/s, done.                                                                                                        
Enumerating objects: 24, done.
Counting objects: 100% (24/24), done.
Delta compression using up to 128 threads
Compressing objects: 100% (22/22), done.
Writing objects: 100% (22/22), 1.41 MiB | 1.72 MiB/s, done.
Total 22 (delta 1), reused 0 (delta 0)
To https://www.modelscope.cn/learnml0/internlm2_5_chat_7b_dblab2self.git
   0a7445d..83899cc  master -> master

在这里插入图片描述
访问模型文件，可以看到模型权重等文件都上传成功了

如有疑问可以回顾L0G3000 https://github.com/InternLM/Tutorial/blob/camp4/docs/L0/git

标签：task5,模型,modelscope,internlm2,xtuner,dict,type,e3,7b
From： https://blog.csdn.net/jenken1209/article/details/144059218